As application deployments get more complex, the tools to make sure they’re working have certainly more than kept up. Not only do we have tools to monitor hardware health, but there are tools to measure network performance, application performance, and resource usage for every conceivable piece of every system.
Unfortunately, collecting data for data’s sake doesn’t keep any application running, and those monitoring tools aren’t the application. They’re only a means to an end. The benefit of monitoring is easy to see, but what about the cost? Is it just the time to set them up?
Let’s look at the real cost of all those monitoring tools.
An Explosion of Tools
First, let’s get the obvious statement out of the way: tools aren’t free, and having lots of them is expensive. Sure, you don’t have to put Nagios on a credit card, but Linux is only free if your time is worthless. For each tool, you put time and effort into learning, configuring, and maintaining each one. Most of the time, this is part of the direct calculus of spinning up a new solution: what insight does this tool bring me, and how long will it take to get running?
No matter how easy it looks, every monitoring tool has a significant setup cost. Sometimes this means buying licenses, sometimes it means setting up a new server to house the data, and sometimes (ugh) it means both. This is a real cost, but it’s only the beginning.
Who Monitors the Monitors?
One of the main problems with the philosophy of multiple overlapping tools is that it leaves all of the problem-solving and information-gathering in the head of the user (you). Any sysadmin’s primary skill, after consuming coffee, is doing this sort of top-to-bottom systems analysis. If it’s so important, though, why not automate it and bring all the information into one place?
Having to use multiple tools to solve a single problem makes understanding the root cause of any issue that much harder. How many hours does it take to make a single-line configuration change? The bottleneck is always understanding the problem, not typing out the change in the terminal. Consider the following 3 scenarios:
- Every couple of weeks, you get email from the Director of Sales that Salesforce.com is intermittently slow.
- About once a week, and always between 3:00 and 3:05, a user from the Fargo branch office tells you that Office365 is slower to load than they expect.
- You get a daily alert that Marketo takes 20+ seconds to load at 9AM, when the sales team is firing up the first daily batch of 17 WebEx sessions.
The first scenario is a nightmare. The second one is intriguing. And the third one? With all the information laid out in front of you, I’ll bet you’ve already figured out how you’re going to fix it.
Army of One
With any system, complexity is the root of all evil, and monitoring tools are no different. The simpler it is to configure and manage, the easier. Part of the appeal of SaaS services is the drastic simplification in setup and management. For better or worse, one of the main side effects of making tools easier to use is that more people use them. In business-critical apps like Salesforce, more people seeing pipeline forecasts is a Good Thing. In monitoring tools, more people seeing bandwidth or SLA reports to Salesforce.com is also a Good Thing.
The problem with democratizing access to information is that teaching teams of people is more costly than teaching a single individual. Creating complexity around the suite of monitoring tools makes the whole arsenal harder to use for teams. This cost isn’t something that’s typically considered, because no individual person ever has an issue learning an individual tool. Look around; when was the last time somebody asked you for help on a tool they should know how to use? That could be a symptom that there are simply too many monitoring tools, and you’re the only one that knows how to use them all.
All tools come with some history and an agenda. Many tools started from “I found this cool way we could get information about X!”, where X is some arcane protocol, like SNMP or timing ICMP packets. While valuable, it can lead to eye-rolling conversations with people in the rest of the company: “Oh, I entirely understand how the backup RAID array running out of space caused my phone call to drop.” The only thing that matters to the rest of the organization is, does the application I care about actually work?
Tools that do not focus on end user experience cost more to maintain, because somebody has to manually correlate technical issues with end-user issues. Everything about resource consumption to traffic patterns is secondary to “Can I log into Facebook Salesforce and look at cat pictures warm leads?” To monitor effectively, you have to start with what you’re building for, and it always comes back to what the application users want to do. Tools that encourage starting elsewhere hurt the conversation, making it more difficult to understand if things are actually working and fix them when they’re broken.
Taming the Zoo
Monitoring tools are duct tape. In theory, you don’t need any, because everything works perfectly all the time. Realistically, you do need some. But if you think you have an off-brand of duct tape, maybe you should consider that the solution, for once, might not be more duct tape. As you’re looking at tools, consider solutions that:
- Minimizes the setup cost and starts providing value as fast as possible,
- Reduces the numbers of tools you have to use to solve problems,
- Are easy to use, especially as the team grows and institutional knowledge gets distributed, and
- Focus on the end user vs. technical minutiae
Maintaining monitoring is a drag on focusing on what really matters: the applications and their users. Think about what you’ve already used today. How could you simplify your toolset?