Filed under: Performance Monitoring
Announcing the TraceView Data API
Today, I’m excited to announce a new feature to TraceView – the Data API!
In a nutshell, the Data API exposes all of those high-level metrics you’re collecting in TraceView over REST, formatted as JSON. Now you can take that data, jam it into your own system and do whatever you need to make sure everybody in your organization sees what they need, when they need it. We’ve also wrapped our configuration API into the same place, so you can interact with all our services in one place.
If you’re itching to get to it, head on over to the docs. If you’re still not convinced, read on! Let’s take a look at what you can do with this.
Latency and Volume
Before you do anything, you need to know how your app is doing right now, and that starts with two questions. How much traffic do I have, and how fast is it? The easy way to look at this is just to plot them:
$ export API_KEY
# For TraceView, tracing TraceView!
| python extract.py
(Check out this gist for the full scripts. I’m using the wonderful matplotlib to generate these plots, by the way. I also recommend gnuplot, if you weren’t poisoned by Matlab in a previous life, like some of us.)
This is actually our backend trace-processing machinery – a RabbitMQ based system with a bunch of Python Celery workers feeding off of it. You can see the periodic change in volume, as most of our customers are based in the US and Canada.
An interesting bit about this particular data set in that there’s variation in the response times, and it seems related to the amount of traffic we get. Let’s plot those against each other, and see what that looks like:
These look pretty correlated! In an ideal world, we’d see a flat line in latency, no matter the traffic. This seems to show that we actually get a bit slower as the number of traces we process increases. This means there’s some sort of resource contention here, either CPU, memory or disk usage on one of the machines. In our case, the app page tells that story for us.
The only layer that’s actually increasing in time with volume is our Cassandra layer. This is pretty common; most of the components in this system scale horizontally, except for writes to the DB. Even with Cassandra’s stellar write performance, we still see a bit of a slowdown. Time to add more machines to the ring!