Profiling Python Performance Using lineprof, statprof, and cProfile
by May 12, 2012

Filed under: Performance Monitoring

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at


When AppNeta acquired Tracelytics and their awesome team – we decided to keep their great blog content to document AppNeta’s journey!

If you’re a regular here, you know how much we care about the full-stack view of applications. Today, let’s zoom in a bit, and talk about the performance of a single layer. In particular, let’s look at profiling Python performance.

As an example, let’s take this bit of code for calculating confidence intervals of the mean of a set of data. If you’re not familiar with confidence intervals, they provide a set of bounds for a given statistic; a 95% interval implies that the true mean lies in the calculated range 95% of the time. One way of calculating this involves generating a number of new data sets from the data you have (random selection, with replacement), and looking at that (meta?-)data set. In practice, you typically run this a couple of times, look at the results, and see if they’re converging. If not, you keep running until they do converge, or you hit some max number of iterations.

It turns out, doing 1000 iterations of a reasonable data set size (say, 100k points) is a reasonably expensive operation — about 2 seconds (ish) on my laptop. That’s not too bad in isolation, but run every 15 minutes, on a variety of data sizes up to 1 million, over all customers here at Tracelytics? We can do much better.

Since this is Python, let’s see if there’s an obvious way to do it. As it turns out, there is: the standard library module cProfile. This is Python’s flexible, deterministic, C-implemented profiler. (There’s also profile, a pure-python version, but unless you have good cause to avoid C extensions, cProfile is probably a better bet.) cProfile will give us a list of all function calls, sorted however we want. Let’s run it, and dump the result to a file:

# runs cProfile.runctx('main(num_iterations)', globals(), locals(), filename='')
$ python 100000

$ python
>>> import pstats
>>> p = pstats.Stats('')
>>> p.sort_stats('time').print_stats(5)
Wed May 16 19:53:40 2012

12300206 function calls in 14.741 seconds

Ordered by: internal time
List reduced from 17 to 5 due to restriction

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
4100000    8.484    0.000    9.057    0.000 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
4    2.916    0.729   14.398    3.600
4100000    2.625    0.000   11.682    0.000 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/
4100000    0.573    0.000    0.573    0.000 {method 'random' of '_random.Random' objects}
45    0.084    0.002    0.084    0.002 {sum}

There’s a few important things to take note of in these results. The first thing to notice is that 3 of the top 4 functions are in the random module. We only use the random module twice in our example, and the initial use (generating the data set) is a constant 100,000 calls. That leaves 4 million calls from the inner loop of confidence_interval. Since that’s a core part of the algorithm, I wonder if there’s a better way to do that?

There’s a few important things to take note of in these results. The first thing to notice is that 3 of the top 4 functions are in the random module. We only use the random module twice in our example, and the initial use (generating the data set) is a constant 100,000 calls. That leaves 4 million calls from the inner loop of confidence_interval. Since that’s a core part of the algorithm, I wonder if there’s a better way to do that?

For a start, let’s try using a different random number generator. As a rule, if you have a slow function in python, and numpy has a replacement function, the numpy version will be faster. Let’s replace random with numpy.random, and try again:

Python 1

Much better! Unfortunately, now we have a different problem. Instead of a single line or function call, we only know our problem exists inside a function.

One way to drill down on this is to use statprof. This is a different type of profiler: instead of instrumenting each function call, statprof wakes up at pre-defined intervals and records a stack trace, including line number. Let’s run that:

$ python statprof 100000
%   cumulative      self
time    seconds   seconds  name
54.81      1.54      1.54
30.24      0.85      0.85
9.45      0.27      0.27
2.13      0.34      0.06
1.33      0.07      0.04
1.20      0.03      0.03

Cool! Looking back at the code, lines 44 and 45 are the selection and array creation:

44 choices_index = [numpy.random.randint(0, n-1) for _ in xrange(n)]
45 choices = [values[i] for i in choices_index]
46 current_mean = mean(choices)

Interestingly, the absolute time reported by statprof is a bit lower than cProfile. That difference is due to the statistical nature of statprof. Even at the default setting of 1ms samples, it simply has to do less work than cProfile, who has to instrument all 4 million calls to the various parts of random. It’s worth remembering this difference in overhead, as some code will exihibit different behavior as the timing changes.

In any case, can we do better? It’s possible that the time here is no longer being spent on calculations, but instead on simply creating intermediate arrays. Let’s combine those lines, and re-run:

$ python statprof 100000
%   cumulative      self
time    seconds   seconds  name
85.52      1.92      1.92
9.54      0.21      0.21
1.50      0.07      0.03
1.50      0.03      0.03
1.28      0.25      0.03

That certainly seems better: 1.92s vs 2.39s (1.54 + 0.85) from above. At this point, though, be careful. Up until now, we’ve avoided directly comparing profiling runs. Looking at the percentage of time spent, we’re actually at about the same place (85% of total); this run just completed faster than the last run. This is a common pattern — as the low-hanging fruit is picked off, it becomes more and more important to validate smaller gains in more formal ways. How, exactly, to collect those stats is a topic for another post.

It’s also worth mentioning Robert Kern’s line_profiler. It’s a deterministic profiler (like cProfile), but it captures profiling data on a line-by-line basis. It also allows selectively profiling specific functions, which can be invaluable on a larger codebase, to weed out the noise from other modules. Unfortunately, because it does capture every line call you ask it to, the overhead of instrumenting every line you run can be substantial (4x slowdown on my laptop). For completeness, here’s what it looks like on our sample:

Function: confidence_interval at line 33
Total time: 6.65548 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
33                                           @profile
34                                           def confidence_interval(values, iterations=1000, alpha=0.05):
35                                               """Return the confidence interval of the mean."""
37         4            7      1.8      0.0      n = len(values) # Sample size to resample with
38         4         4277   1069.2      0.1      value = mean(values) # Estimator
40                                               # Bootstrap the standard error by drawing a bunch of choices from the given
41                                               # values, then calculating the variance of that synthetic distribution.
42         4            8      2.0      0.0      synthetic_values = []
43        44           82      1.9      0.0      for _ in xrange(iterations):
44   4000040      6574458      1.6     98.8          choices = [values[numpy.random.randint(0, n-1)] for _ in xrange(n)]
45        40        76320   1908.0      1.1          current_mean = mean(choices)
46        40          226      5.7      0.0          synthetic_values.append(current_mean)
48                                               # Compute the confidence interval using pivotal intervals
49         4           22      5.5      0.0      synthetic_values.sort()
50         4           42     10.5      0.0      lower_index = int(math.floor(iterations * ( 1 - alpha / 2 ) ))
51         4            5      1.2      0.0      lower = 2 * value - synthetic_values[lower_index]
52         4           10      2.5      0.0      upper_index = int(math.floor(iterations * alpha / 2 ))
53         4            5      1.2      0.0      upper = 2 * value - synthetic_values[upper_index]
54         4            7      1.8      0.0      return {'lower': lower,
55         4            5      1.2      0.0              'upper': upper,
56         4            4      1.0      0.0              'value': value,
57         4            4      1.0      0.0              'num': len(values)}

Finally, remember that profiling is just one component of a performance toolkit. Most performance problems aren’t complex; they’re just well-hidden. Before writing the above code, I hadn’t realized how expensive random number generation could be. Simply taking the time to look for a faster function saved over 90% of the time spent.

Make it work, then make it pretty, then make it fast. And with the right tools, it doesn’t even take that long.