Delivering on the Promise of APM
by June 1, 2012

Filed under: Performance Monitoring

Tracelytics

When AppNeta acquired Tracelytics and their awesome team – we decided to keep their great blog content to document AppNeta’s journey!

Today, June 1, marks the two year anniversary of the Tracelytics.  It’s a tempting opportunity to reminisce, but more importantly it’s also a milestone at which assess our progress: why did we embark on this journey, what problems were we trying to solve, and how far have we come towards solving them?

Service-Orientation

Prior to starting Tracelytics, I’d been working with Spiros at Songza.com, then called AmieStreet.com (property later acquired by Amazon).  We were fortunate to work with an amazing engineering team solving some really interesting challenges: dynamic real-time pricing, constant ingestion and release of new music, search and recommendation.

More interesting than the product challenges, however, was the architecture that enabled us to solve them. Large portions of the infrastructure were implemented as services loosely coupled to the applications that accessed them using Apache Thrift, an open–source RPC protocol developed originally at Facebook.  We found services with their well-defined interfaces to be maintainable from an engineering perspective and scalable from a provisioning perspective. Additionally, with Thrift’s language–agnostic bindings, we could write each service in the language best suited to its task. So, while the frontend was written in PHP, the search service was Java–based, the pricing engine was written in Erlang, the spelling corrector was written in Python/c, and so on.

Our service oriented architecture yielded a lot of benefits, yet it also caused its fair share of headaches. I particularly began to notice these in the ops side of my devops role: when parts of the site were performing poorly, how could we diagnose the problem quickly? Is one of the services hung, causing RPC timeouts? Which hosts did this slow request hit during the course of assembling the data on the page? Where are these queries coming from—is it from the ORM in the app, or perhaps from a service, or perhaps from a service call triggered by a request to the app?

Latency in the APM Market

The challenges that we faced were not unique to our app: almost any modern web app is service-oriented, at least to some degree. A load balancer forwards a request to a webserver, which may serve static content, or pass it along to a dynamic backend that queries databases and caches. Perhaps there’s even other web stacks involved via REST calls, etc.

Yet there were no solutions available to help us develop and run the site more efficiently–we had to home-roll tools that would let us get quick scans of the services, and frankly a lot of tailing logs was involved.  So a bit later, after Spiros had moved on from Songza, when I got a phone call from him that said “I found this great idea to solve performance problems for service-oriented web apps” I was immediately intrigued.

What was the idea? Using an approach similar to research done as part of the X-Trace project, we would build out instrumentation that could follow requests up and down throughout the stack, gathering timing and event data deep in the application layer and out to service calls.  This data would then be used to generate actionable insights in real-time.

Spiros and I teamed up with Chris, then a distributed systems PhD candidate at Brown, to start Tracelytics.  We decided to start by applying to Betaspring, an accelerator in Providence, RI with the following plan: instrument a bunch of open-source web stack components, build a real-time data processing pipeline, and deliver performance data via a SaaS app.  Would they accept us?

Fortunately we weren’t the only ones who thought this was a good idea–in fact, when we went to talk to the guys at Betaspring they loved it.  Soon we found out that Google had been developing a similar approach to APM internally, and when they heard about Tracelytics they became one of our first investors.  That winter we took our seed investors’ wisdom and money and put our heads down building out Tracelytics and iterating with our awesome early customers.

Fast Forward Two Years

Since then, we’ve launched, presented at conferences, been joined by an amazing team of smart people, and sped up a lot of web apps.  As for our original goal, we wanted to help developers and operations engineers take advantage of service-oriented architectures, build web apps more efficiently, and solve problems faster.  How’d we do?

Today, our customers can trace incoming requests at any level of detail through their entire stack and view the latency and timing of each part of the request.  Understanding the performance of tiers and services no longer requires a bevy of tools or logfiles.  Anyone can get full-stack insight, making performance analysis and troubleshooting just a few clicks away.  The powerful interactive visualizations, like the heatmap and traceview, provide unparalleled insight into performance–in real time.

Best of all, Tracelytics doesn’t just deliver features: it’s delivering real results.  How do you get a 200x speedup in 24 hours?  Just ask Joe Stump at sprint.ly.  We’ve helped customers find solutions to problems ranging from mundane query optimization to esoteric bugs in mis-implemented caching libraries and problems in routing.  It’s exciting to say that there are bottlenecks that can be revealed with Tracelytics that no other APM product will see, and even more exciting to think about what we have in store.