Introducing TraceView Alerts
by February 7, 2012

Filed under: Company News

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

Modern web systems are complex, and running them is not easy. Even a robust, well-designed system requires constant maintenance and tuning just to make sure it runs smoothly. Here at AppNeta, we’re constantly working to tame that complexity and make running websites simpler, easier, and more intuitive. To that end, we’re exicted to announce a new feature: TraceView Alerts.

TraceView Alerts are similar to other monitoring solutions, with one important difference. Instead of keeping track of machine availability, network connectivity, or other low-level metrics, Alerts measures what matters most: your website’s performance. Performance is more that just latency distributions and sparklines; it is a direct measure of how effectively you are serving your content to users. Because of this, measuring, tracking, and keeping you up-to-date on your website’s responsiveness is a more relevant metric of availability than availability itself.

Why Performance Monitoring?

Latency Spike DemoThe reason that performance monitoring is so powerful is simple: high latencies are a leading indicator for downtime and other serious operational issues. Before a machine goes down, either due to a traffic increase, hardware defects, or network contention, it will generally start to fail by responding to existing requests more slowly. For example, a RAID5 array with a bad disk will not stop working; it will simply respond to requests more slowly, due to having to recompute the lost data from checksum blocks.

Unless you’ve had the forsight to set up OS-level monitoring on every critical device on your system, this sort of problem could easily go unnoticed. However, setting up a single top-level Alert can warn you of lurking issues before they become problems.

Performance Monitoring Demo

Slicing and Dicing

Because it’s so important for Alerts to operate on data you already work with, we’ve built them right into our Layer Summary page, at the App level. Apps tend to be pretty coarsely grained, so Alerts also support filters on domains, URLs, controllers, and actions. Once you’ve picked a threshold and verified its behavior, we’ll keep an eye on it. By default, we’ll email you when something goes wrong, and again when the system recovers. And you can tweak them whenever you like.

Review and Drilldown

Of course, learning about a problem is only the beginning of fixing it. Back on the Layer Summary page, we’ve added a new section devoted to reviewing past Alerts. This shows only existing Alerts that could be triggered by the data on the page. This lets you see your past Alerts, right next to the data that caused them. From here, you can drill down to the offending layer, find the machine that’s going bad, and make it right at quickly as possible.

Performance Monitoring Demo 2

A note for the mathematically-inclined

As you review past Alerts, you may notice that the data in some charts may seem to disagree with when we saw a violation of your Alert. This is because they are calculated slightly differently. Most data is averaged over the displayed window — points on the day view are 15 minute averages. To provide up-to-the-minute responsiveness, Alerts (including previews) uses exponentially weighted moving averages. This (honestly, fairly boring and standard) way of calculating the current value of a quantity smooths over spikes and noise in the data, while still providing a fast response to meaningful changes.

Get Started!

We’ve already gotten a lot of mileage out of these internally, and we look forward to hearing your feedback. If you think you could benefit from monitoring at a higher level, set up a few Alerts, and let us know what you think!