Categories Performance Monitoring

TraceView Data API

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

Announcing the TraceView Data API

Today, I’m excited to announce a new feature to TraceView – the Data API!

In a nutshell, the Data API exposes all of those high-level metrics you’re collecting in TraceView over REST, formatted as JSON. Now you can take that data, jam it into your own system and do whatever you need to make sure everybody in your organization sees what they need, when they need it. We’ve also wrapped our configuration API into the same place, so you can interact with all our services in one place.

If you’re itching to get to it, head on over to the docs. If you’re still not convinced, read on! Let’s take a look at what you can do with this.

Latency and Volume

Before you do anything, you need to know how your app is doing right now, and that starts with two questions. How much traffic do I have, and how fast is it? The easy way to look at this is just to plot them:

$ export API_KEY=xxx # For TraceView, tracing TraceView!
$ curl "https://api.tracelytics.com/api-v1/latency/Prod_Web/server/series?key=$API_KEY&time_window=week" | python extract.py

(Check out this gist for the full scripts. I’m using the wonderful matplotlib to generate these plots, by the way. I also recommend gnuplot, if you weren’t poisoned by Matlab in a previous life, like some of us.)

This is actually our backend trace-processing machinery – a RabbitMQ based system with a bunch of Python Celery workers feeding off of it. You can see the periodic change in volume, as most of our customers are based in the US and Canada.

An interesting bit about this particular data set in that there’s variation in the response times, and it seems related to the amount of traffic we get. Let’s plot those against each other, and see what that looks like:

These look pretty correlated! In an ideal world, we’d see a flat line in latency, no matter the traffic. This seems to show that we actually get a bit slower as the number of traces we process increases. This means there’s some sort of resource contention here, either CPU, memory or disk usage on one of the machines. In our case, the app page tells that story for us.

The only layer that’s actually increasing in time with volume is our Cassandra layer. This is pretty common; most of the components in this system scale horizontally, except for writes to the DB. Even with Cassandra’s stellar write performance, we still see a bit of a slowdown. Time to add more machines to the ring!

TR Jordan: A veteran of MIT’s Lincoln Labs, TR is a reformed physicist and full-stack hacker – for some limited definition of full stack. TR still harbors a not-so-secret love for Matlab-esque graphs and half-baked statistics, as well as elegant and highly-performant code.