Why numpy
is So Great
July 5, 2018 - Python
I've not been subtle about my love for
numpy
. I dedicated an entire blog post to how cool they are while talking about Python Data Structures. However, I'd like to once again shout my love for numpy
into the ether. Today's topic, vectorization of operations.Setting up the Experiment
To frame this discussion, let's look at a single task: trying to calculate the distance between one vector and a bunch of other vectors in a two dimensional space. This sounds a bit silly, but it's a common machine learning task so it's actually a pretty decent benchmark. Let's start by generating some data, specifically one new vector and an array full of 100 vectors to compare to.import numpy as np
new_vec = np.array([1,2])
all_vecs = np.random.uniform(size=(100,2))
To calculate this normally, we would want to use the Euclidean distance formula $$ d = \sqrt{(x1-x2)^2 + (y1-y2)^2}$$ In code form (without using arrays), we could do something likex = all_vecs[0]
np.sqrt((new_vec[0] - x[0])**2 + (new_vec[1] - x[1])**2)
So having done that, we now have the distance between our new vector new_vec
and the first of our all_vec
comparison vectors. To scale this to all 100 vectors, we'd need to tuck that bit of code into a for loop. Let's take a look at thatoutput = []
for x in all_vecs:
dist = np.sqrt((new_vec[0] - x[0])**2 + (new_vec[1] - x[1])**2)
output.append(dist)
That's all we need to do the job if we don't want to use arrays. For any number of vectors in all_vecs
, we can loop through and calculate the distance. It's not an elegant solution, but it does the job. Now let's see how we'd do the same thing in numpy
. To begin with, let's remember that numpy
does broadcasting. For example:a = np.array([1,2])
b = np.array([[1,2],[3,4],[5,6]])
print(a + b)
> [[2,4],[4,6],[6,8]]
Note that it already realizes that these are vectors and does the addition row-to-row, so we can use that to do the vector subtraction, then all we need to do is square some things, sum the columns, and square root. Let's see that in action:output = np.sqrt(np.sum((all_vecs - new_vec)**2, axis=1))
The only fancy thing there is the axis=1
, which tells numpy
to sum along rows instead of columns. That's it though, numpy
handles all the operations under-the-hood otherwise. Woo!Testing the Speed
So now that we know how to do this in both methods (regular Python vsnumpy
), let's see how fast each method is. In particular, I'm interested in how this scales with data size, so we'll generate larger and larger versions of all_vecs
, starting with a few 100 vectors and ending with nearly 1,000,000 vectors. I'm going to use the time
module to track how long each operation takes. Here's the code to do this with regular Python.import time
list_timing = []
for num_vecs in np.linspace(10,1000000,200):
all_vecs = np.random.uniform(size=(int(num_vecs),2)).tolist()
# To measure when we start our process
start = time.time()
# The actual process
output = []
for x in all_vecs:
dist = np.sqrt((new_vec[0] - x[0])**2 + (new_vec[1] - x[1])**2)
output.append(dist)
# Figure out when the process was done and
# keep track of how long it took
end = time.time()
list_timing.append((num_vecs, end - start))
Now the equivalent code, but in numpy
:array_timing = []
for num_vecs in np.linspace(10,1000000,200):
all_vecs = np.random.uniform(size=(int(num_vecs),2))
# To measure when we start our process
start = time.time()
# The actual process
output = np.sqrt(np.sum((all_vecs - new_vec)**2, axis=1))
# Figure out when the process was done and
# keep track of how long it took
end = time.time()
array_timing.append((num_vecs, end - start))
Now let's plot the outputs on a graph together to see how they scale. (The numpy
version is called "Arrays" on the graph).
That's just wow. We can see that arrays are WAY faster (lower time is faster) than the list method. Note that some of those spikes are pretty large, but still within expectation of fluctuations due to computer hardware - so the trend overall is more imporatnt. It's hard to see the actual behavior of the array version though, so let's look at the same plot, but with the y-axis set to a logarithmic scale.
numpy
is legendary. I love it so much. In this case, you can see that it still scales with the data in roughly the same manner as the Python method, just 2 orders of magnitude faster! That's a beautiful result that further demonstrates just how powerful numpy can be. ------
Code for making the plots (comment out the yscale line for linear):
plt.figure(dpi=150)
list_X, list_times = zip(*list_timing)
array_X, array_times = zip(*array_timing)
plt.plot(list_X, list_times, 'r', label="Lists")
plt.plot(array_X, array_times,'b', label='Arrays')
plt.yscale('log')
plt.xlabel("Number of Vectors")
plt.ylabel("Computation Time")
plt.legend();
------