Until November 2012, Headnet relied on “old-school” monitoring tools, such as open-source infrastructure monitoring, to ensure the machines its applications ran on were up to the task. When a problem arose, sometimes they could add more RAM or replace a disk to fix the issue, but more often than not, it was an application issue. In many cases, Headnet engineers were left digging through application logs for days before identifying a resolution.
After installing TraceView, not only did the Headnet team find problems they couldn’t see before, but their resolution time dropped dramatically. “I showed TraceView to one of my colleagues, and he was like ‘Wow.’ It took literally minutes to see issues you could improve with it,” said Anton Stonor, Web Technology Lead at Headnet.
One of the first issues identified by Stonor and his team was a sporadic spike in latency. According to the infrastructure monitoring, there was no problem, and even the application itself seemed to be responding quickly. Looking at the slow requests directly revealed that Apache would occasionally think the backend was down, and those requests would stall in Apache until it re-established the connection. Stonor changed the Apache configuration to eliminate those slow reconnections, resulting in a much more consistent user experience.
In another Drupal application, customers complained about a particular page taking many seconds to load–but only occasionally. Without a reproducible test case, Stonor’s team struggled to find a fix, as the problem could not be recreated in their environments. With TraceView, they immediately saw that the page in question was synchronously running a maintenance job before rendering the HTML. Moving the job into the background cut the time to consistently less than one second.
Throughout their projects, Stonor’s team found more subtle issues into which they previously had no visibility. In their SQL DBs, not only were there slow queries, but “there would be either way too many queries or [queries] would be fetching too-large data sets. In TraceView, you could see exactly the amount of queries and which one of them was taking a long time, and what code is to blame,” said Stonor.
TraceView also illuminated the impact of slow external calls. In one case, an API call to a reverse DNS lookup service was to blame, and Stonor’s team identified the problem in production immediately. “Something that may have taken days, weeks or months to find, we were able to find in just a matter of a few minutes,” said Stonor.
Across Headnet’s projects, TraceView significantly improved the MTTR (Mean-Time-to-Resolution), not to mention the MTTPC (Mean-Time-to-Pretty-Chart). In the first two months with TraceView, Headnet dramatically reduced the average and worst-case load times on a number of sites, improving customer satisfaction and decreasing support incidents – all without taking significant time away from new development.