Debugging After the Engineers Have Left
by Rob Salmond on April 2, 2014
We build tools to serve both halves of the coin from which devops takes its name, but we frequently find other uses for those tools across the company. Recently while working late I received a message from an engineer on the sales team looking for some help with a trial he was running. The client had come to us to for the full stack treatment and we intended to deliver, but there was a snag he hadn’t seen before.
Our secret to providing full stack visibility lies in tightly integrating our toolset to provide a cohesive view. Whether you’re hunting for the hop that’s degrading VOIP quality, or the database query that’s making your check-out page slow every other Wednesday at noon, we believe an end to end view is the best way to diagnose issues. One of our key views in the product is the view that pulls data from both AppView and TraceView.
On the AppView end, a headless browser runs a Selenium-like script against a given web app from an arbitrary network vantage point (or points) collecting performance data as though it were a user. On the other end a TraceView instrumented web app scrutinizes all the code and infrastructure involved in servicing incoming requests, synthetic or otherwise.
By matching synthetic requests from appliances to the corresponding server side traces which serviced them our users get an complete view of performance. Obviously, these come from different sources, so they’re stored in different services on the backend. Since the UI ties everything together, it can seem a bit like magic the first time you shine that particular spotlight on your own application.
That night though, the usually seamless magic wasn’t happening. My Hipchat lit up.
(That’s 11pm for my co-worker on the East coast!)
He told me they had a few appliances already running AppView scripts, and I’d personally helped set up TraceView data collection for this company a few days before so I knew that was working. What was missing was the integration.
Well, that, and considering the late hour we were also missing the entire engineering team including everyone who had designed and implemented the ties that bind our tools together.
Fortunately, we were equipped with a powerful support tool, because of course all of our own web apps are running TraceView. We had everything we needed to investigate despite having no ops folks around to SSH into production, or developers on hand to answer functionality questions.
I logged into TraceView to peer under the hood.
(Pay no attention to the man behind the curtain!)
As a full-time TraceView support engineer I’m quite familiar with the admin dashboard for managing integration with AppView, but only from one side of the equation. Populate a field with the right URL, hit the button and wait for the wheels to turn.
(The screenshots are real but the account names have been changed to protect the innocent.)
Just as my co-worker had said, this account wouldn’t sync and none of my support teammates were around to explain what was going on from an AppView perspective. So, armed with just an admin control we knew was involved and the TraceView data for that admin page I punched the button’s endpoint into the URL search field.
(We both took turns mashing the button to no avail.)
Digging into the individual traces for these requests I began stepping through each operation searching for where things had gone wrong, and soon enough found an RPC call being made from TraceView to AppView that seemed important.
(Rob Salmond? I know that guy!)
When the RPC call returned it was followed by a database update which populated several configuration fields with the unhelpful value of “NULL”. Now I knew this RPC call wasn’t returning anything, but why? I turned my attention to the TraceView dashboard for the servers which power AppView.
Again I searched for traces, this time using the URL to which the RPC call had been made and began stepping through operations until I found another span which looked interesting.
This database query took less than a microsecond, didn’t seem like it could have done much work in that time.
Under the `backtrace(s)` tab associated with the query in the screenshot above was, exactly as intended, a complete call stack of all classes and methods involved in its execution. Pouring over the thousands of lines of code inside this app to work this out on my own would have taken ages, but with the relevant field names and classes and some quick grepping the answer was obvious. The fields in question were configurable via an admin page in the AppView back end that I’d never seen before but which clearly needed attention for this RPC call to succeed.
(Knowing what to look for makes it easier to find!)
One quick update and at the next turn of the wheel everything slid into place with a satisfying 200 OK. We later found out that while this process is fully automated for new users, existing accounts like this one looking to expand their use of AppNeta tools require some manual intervention, which, if we hadn’t stuck our noses in that night would have been dealt with first thing in the morning by our provisioning team!
In the end we both went home confident we were ready to make the trial magic happen the following day. Two non-programmers with no bugs filed, no pagers alerted, and no customer impact incurred.
At AppNeta we’re all about empowering developers and system engineers with the tools to write faster code and build better infrastructure, but while we’re at it why not equip support with deep diagnostic tools as well? Then engineers remain free to do what they do best; build great apps that just fly.