making you python faster
making you python faster
Brandon Rohrer
Chapter 1:
Can't Artificial Intelligence
Already Do That?
Chapter 2:
Keeping Time with Python
Chapter 3:
Getting Processes to
Talk to Each Other
Chapter 4:
Making Animations
with Matplotlib
Chapter 5:
Simulating the
Physical World
Chapter 6:
Making Your Python
Code Run Faster
About This Project
How to Train Your Robot is a side project I've been working
on for 20 years. It's a consistent source of satisfaction. A big
part of the joy is sharing what I learn as I go. And who
knows? Maybe someone will find it useful.
Brandon
Boston, USA
August 12, 2023
Making Your Python
Code Run Faster
Chapter 6
The thing is they aren't totally wrong, but they left out an
important part of the quote: "Premature optimization is the
root of all evil." The trouble with optimization is not that it's
a bad thing. The trouble is that it's so much fun. It's
addictive. It can quite easily become an obsession that
eclipses all other concerns and takes up all the oxygen in the
room.
1
Knuth, "Structured Programming with Goto Statements". Computing
Surveys 6:4 (December 1974), pp. 261–301, §1. doi:10.1145/356635.356640
How to Train Your Robot
Profiling
The best recipe I've seen for writing fast programs is this:
or restated:
Step 2 is also called profiling and the tools that let you watch
programs to find the slow parts are called profilers. There are
several good ones to choose from, including a suite of
profiling tools built right into Python called cProfile. My
personal favorite is one called py-spy. Unlike cProfile,
py-spy runs in a separate process so it won't trip up your
Python process and introduce overhead. It can trace its
lineage back to the one and only Julia Evans' work on rbspy.
$ ps -a
...
71974 pts/1 00:00:00 python3
71975 pts/1 00:00:27 python3
71976 pts/1 00:00:24 python3
...
How to Train Your Robot
they are still the longest poles in the tent gives a hint as to
just how much of a bottleneck they are. Before optimization
they were many times slower.
Vectorization
There's probably a lot of smart stuff to be said about
vectorization, but it boils down to this: whenever you can
take your math and put it into Numpy arrays, do it. It's so
much faster. Python is good at a great many things, but
quickly iterating over for-loops is not one of them. The folks
who do the under the hood optimizations for Numpy arrays
are amazing, and I take advantage of their fine work
wherever I can.
N = # a big number
A = list(range(N))
B = list(range(N))
The second case is adding the same numbers, but in the form
of Numpy arrays. (Actual code in sum_array.py of the
chapter repo.)
A = np.arange(N)
B = np.arange(N)
There's a bit more to the code to make sure that the compiler
doesn't cheat and avoid doing the work, but the snippets
above are the important part. (For a refresher on timing code
in a way that measures what we think we're measuring,
revisit Chapter 2.) For N = 10 million, the for-loop case takes
1300 milliseconds and the Numpy array case takes 18
milliseconds on my machine.
The exact numbers here aren't the important part. If you run
these scripts, your results may be very different. It is the
nature of optimization to be very specific, and your results
will vary with different data types, array sizes, operating
system, Numpy version, and processor type. So the
important takeaway here is not that Numpy is 72 times
faster, but that it can speed things up a lot, where "a lot" will
vary by context.
That's the trickiest part. With the tiled coordinates from both
groups we can get the differences by doing an element-wise
subtraction of one array from the other. We can also do
element-wise squaring, add it to a similarly constructed set
of y-coordinate differences, then take the square root of the
How to Train Your Robot
@njit
def add(A, B, C):
for i in range(n_sum):
C[i] = A[i] + B[i]
return C
You can see one of Numba's quirks in the way we built the
function. It takes A, B, and C as input arguments, both the
operands and the result. By handing it the return variable as
Making your Python Code Run Faster
Another Numba quirk is that the first time through the code
is always slow. When I run the simulation from the previous
chapter it takes several seconds of staring at a black screen
before the blocks render and start bouncing around. When I
turn off Numba, this startup lag goes away entirely
(although my code then runs too slow). Compiling these
functions takes a little bit of time, small though they are.
print("Warming up simulation")
sim.step()
This gets the pre-compilation done and out of the way before
the simulation is committed to keeping up with the wall
clock and saves you from getting a lot of angry warnings
from the Pacemaker.
For-loops. Seriously.
Execution speed is hard to guess beforehand. When we were
experimenting with Numpy arrays at first, we saw a huge
speed up going from a naked for-loop to Numpy powered
element-wise addition of whole arrays. If we modify our
Numba function to do whole array operations instead of a
for-loop, it actually slows it down considerably, to 42 ms.
@njit
def add(A, B, C):
C = A + B
return C
@jit(nopython=True)
Numba doesn't come for free, but it's also not the plague.
Debugging it can be a hassle to deal with, but it's a big
improvement over earlier generations of Numba which just
reported that there was an error somewhere in the
compilation function. Happy hunting! I want to give a huge
shoutout to the people working on Numba that are putting
in the work to make such an important tool work better and
are doing a really good job at it.
A_cast = A.astype(np.long)
2
Yes, it's true that you can pass an empty output array to Numpy
using the out argument, and it will be populated with the result.
This ameliorates the performance hit a little bit. Now stop
interrupting me with facts while I'm trying to make a point.
Making your Python Code Run Faster
Then after the first two lines of code compile and execute, I
add two more. This way I know that any new errors that
occur are probably due to the most recently added code.
Matrix multiplication
The one apparent exception to all of the above rules is matrix
multiplication, the repeated application of the dot product
across the rows and columns of a couple of two dimensional
arrays. This particular operation comes up so often in
computationally intense applications that it has become the
standard by which all numerical computation packages are
measured. It's the backbone of deep learning and modern
machine learning methods. It is the problem for which an
entire class of silicon hardware, graphical processing units
(GPUs), have been optimized to solve. Because it has gotten
so much attention at every level, there are a lot of tricks
available to Numpy for speeding up matrix multiplications,
and because that is an important measure of success it uses
all of them.
But wait! The Numba team has helped us out even here.
There is another argument we can pass with the jitting
decorator.
@njit(parallel=True)
Making your Python Code Run Faster
This rounds out our list of rules. They are definitely not firm
or important or inviolable enough to be considered laws or
commandments, so I present:
How to Train Your Robot
@njit
def body_interactions_numba(
f_x_a, f_x_b, f_y_a, f_y_b,
k_a, k_b, r_a, r_b,
x_a, x_b, y_a, y_b,
v_x_a, v_x_b, v_y_a, v_y_b,
sliding_friction, inelasticity,
):
epsilon = 1e-12
f_x_a[i_row] += f_x_ab_contact
f_x_b[j_col] -= f_x_ab_contact
f_y_a[i_row] += f_y_ab_contact
f_y_b[j_col] -= f_y_ab_contact
Monitoring
Our cardinal rule of Try It and Test It isn't limited to when
we first write the code. Interesting programs change over
time as they run. In the case of our simulation, some
configurations are more computationally demanding than
others, when bodies are in close proximity for example. Any
code that makes use of a stream of data will be encountering
new and unforeseen states on a regular basis. Any programs
that accept human input or feedback signals, like human
directed reinforcement learning to choose an example
completely at random, have the additional complexity of
dealing with an entirely unpredictable, and sometimes
mischievous or adversarial, human being on the other end.
A program that behaves well under development and
testing conditions may run into difficulties later when
running for real, also known as running in the wild or in
production.
time and so we wind up with 1000 yeses and nos per second.
That's more than we can easily digest, and it raises the
question of how to communicate that volume of information
to a human with limited reaction time and attention span.
It's helpful to play the What if? game when deciding what to
show and how to show it. What if a single clock cycle ran too
long? Would I try to speed up my code? Modify the
simulation? Or would I write it off as an anomaly? There are
a lot of things that could cause a single cycle to go over time,
and most of these have little bearing on the performance of
the simulation, so the right answer is to do nothing. A single
time step's violation is inconsequential, and we wouldn't
want to do anything in response if it occurs.
There are several options open to us, and they each tell a
different story. The most popular aggregation method is
averaging. Averages are efficient to compute and have a nice
intuitive interpretation. It's hard to go too far wrong with an
average. However, for the question we are trying to answer
it's worth thinking through the case where, if the overtime
deviation were high for a quarter of a second but zero for the
other three-quarters, how would we want to represent that?
If we represent it with the average, then it will appear that
the deviation was consistently low for the entire second.
We're missing out on some of the information that we care
about. A brief period of high overtime deviation is of
interest. Its appearance suggests that perhaps there's a
problem that needs addressing.
are grouped. Similarly, the 90th percentile only tells you the
point above which sit 10% of your measurements. To get a
richer sense of the distribution, it would be useful to look at
a whole set of statistics. One common way is to look at the
10th percentile, the 20th, etc. up to the 90th, that is, to look at
each decile. This is a helpful way to get a sense of the overall
shape of the distribution.
mngr = plt.get_current_fig_manager()
mngr.window.setGeometry(x_left, y_top, width, height)
where x_left the distance in pixels from the left edge of the
screen to the left edge of the window, y_top is the distance in
pixels from the top of the screen to the top edge of the
window, width is the horizontal extent of the window in
pixels, and height is the vertical extent of the window is
pixels.
With the window size and location hard coded, we can focus
on its contents. All we really need here is a single line
showing the recent history of overtime p90 values. Plotting a
line is the most basic of tasks in Matplotlib. It could be done
in just one line of code, so the fact that I chose to do it in 55
lines deserves some explanation.
The Dashboard
A regularly updated performance plot of this nature is a
common tool for monitoring. It can be a lot more intuitive
than printing numbers in the terminal and looks better in a
PowerPoint presentation. It's typical to see a handful of these
plots in one window, in which case it's referred to as a
dashboard, evoking the collection of dials and status lights
on the dashboard of a car.
All of these are entirely feasible to do. We have the tools and
the computation budget to pull it off. So why wouldn't we?
There are some hidden costs here that are easy to ignore
until too late. The most fundamental is that not all
information is equally useful. We adhered to one guiding
principle when designing this plot–clearly answering a
question. What do we need to know in order to decide when
to take corrective action? How would we know when
computation was becoming a big enough bottleneck that we
needed to do something about it? The focus on action, on
Making your Python Code Run Faster
them all, then shut itself down. But that doesn't handle the
case where the runner itself runs into trouble and gets shut
down or crashes first. I even tried adding another process
whose sole function was to watch the runner, and if it had
difficulties, close the runner down, then all of the child
processes, then itself. That kind of worked, but there were
still odd corner cases where it didn't catch everything. Also, I
was unsatisfied with the extra overhead and system
complexity I needed to add.