Filed under: Company News
When monitoring your application and debugging performance problems, it really helps if performance data is organized the same way that your application is. End-user requests are handled by applications composed of processes running on different virtual or physical hosts. Today, TraceView is changing the way we present host monitoring data in order to more accurately reflect the operational experience, grouping host performance metrics by app.
Why Host Metrics?
Traditionally, metrics such as CPU utilization, load, and disk latency were the main tools for sysadmins to understand the performance of applications. Question about performance? SSH in, run top/iostat/etc, or maybe wait for your nagios alerts.
Unfortunately, these metrics alone are so low-level that they do little to describe the actual performance of your application for end users. Saturated CPUs might be a warning sign in one situation, while in another one, it’s idle CPUs that mean your customers are in trouble. As applications become more complex and distributed, looking at isolated metrics isn’t enough to understand how an application is scaling and performing.
With Full-Stack tracing, a much more comprehensive picture of application performance is provided, detailing the latency and performance impact of each component of the browser, network, and application. However, that doesn’t mean that host metrics are outdated–quite the opposite. Every user request is powered by hardware at some level, and understanding the low-level health can be key to diagnosing certain types of latency problems–that’s why we’ve always included host monitoring for free with TraceView.
Your Hosts Where They Belong: In Context
That performance data is associated with individual requests, so you can understand if a particular app server was swamped, or a database query was an outlier because of resource contention:
This flips the traditional monitoring cycle upside down. Instead of monitoring host metrics, looking at application health first means fewer, more meaningful alerts, but responding to those issues frequently means troubleshooting slow disks or overloaded CPUs. Once you know the machine the problematic request ran on, you no longer have to look through a huge list of metrics across dozens of machines, which are potentially running multiple services.
That doesn’t mean it’s not useful to visualize all those machines and their metrics in one place. In fact, it’s frequently quiet useful to look across all machines and look for utilization hot spots. Combining this approach with high-level application metrics can track down virtually any problem.
The Right Hosts
Looking at everything is viable in smaller deployments, but what happens when the application ecosystem become more complex? Monitoring for a 2-server PHP + MySQL setup can (paradoxically) use a larger ranger of tools than a 50-service, 500-host deployment. SSH’ing into your servers to keep them in line simply doesn’t scale, even much past a single host.
In these more complex environments, even the “look at all the metrics” approach starts to fall apart, since so many of the machines are doing different things. To mitigate that, starting today, we’re going to group hosts into the apps and services they belong to by default. This means that all host and JVM performance data will show up in an easily scannable list, right next to the application performance data we’re already collecting.
Check it out, and let us know what you think! If you’re not already getting full-stack tracing to tie your application performance data together, check out a free 14-day trial of TraceView.