Comparing NumPy efficiency vs Pure Python
DONE

Files associated with this lesson:

Lecture.ipynb

rmotr

Comparing NumPy efficiency vs Pure Python's¶

In [1]:

import sys
import numpy as np

purple-divider

Hands on!¶

As we have already told you, numpy will make your array-processing code more efficient. But the question is, how much more efficient? You'd be surprised with some of the results; sometimes numpy optimizations improve the speed of your code by 100x or 1000x. For more information, watch this video by Jake VanderPlas: Performance Python: Seven Strategies for Optimizing Your Numerical Code.

Size of objects¶

As we've discussed before, Python numbers are "boxed", which means that, an integer not only contains the actual value of the int, but a lot of extra information about the object. So, for example, what would be the accepted size of an integer in bytes? 2? 4? We can see the real value with this function:

In [2]:

sys.getsizeof(1)

Out[2]:

28 bytes! That's a lot of memory for just one tiny int. With larger numbers, it gets worse:

In [3]:

sys.getsizeof(10**100)

Out[3]:

Numpy numbers a lot more efficient in space; they're mapped closer to their C representation, they're also fixed and we have the chance to pick the correct one based on our implementation.

For example, the default numpy int takes 8 bytes of memory:

In [4]:

np.dtype(np.int).itemsize

Out[4]:

np.int is the "default" int in numpy, it's just an "alias" for np.int64 or np.int32, in this platform, it's np.int64:

In [5]:

np.dtype(np.int) == np.dtype(np.int64)

Out[5]:

True

That's why it takes 8 bytes: 8 x 8 = 64.

Numpy also offers more granularity when picking the correct type of our arrays, for example, you can create an "unsigned 8bit int", that just takes 1 byte in memory:

In [6]:

np.dtype(np.uint8).itemsize

Out[6]:

That means that, if you're dealing with small numbers (0-255) you can save a lot of memory space. As a reminder, to set the type of an array, just use the dtype attribute:

In [7]:

np.array([0, 255], dtype=np.uint8)

Out[7]:

array([  0, 255], dtype=uint8)

You have to be aware of the "limits" of that type. For example, if we exceed the 255 limit, we get back 0:

In [8]:

np.array([0, 256], dtype=np.uint8)

Out[8]:

array([0, 0], dtype=uint8)

For a complete reference of numpy data types, check this document.

green-divider

A note about performance and efficiency¶

We'll compare now the same operation both with pure python code and with Numpy, to see the real performance impact of it. What we want to do is the sum of the squares of the elements of an array, which in pure python looks like:

In [9]:

a = np.random.randint(1, 999, size=1_000_000)

In [10]:

sum([x ** 2 for x in a])

Out[10]:

333177525785

The same operation with numpy is performed in this way:

In [11]:

np.sum(a ** 2)

Out[11]:

333177525785

How much time the operation takes? We can use the %time magic function to get an idea of elapsed time:

In [12]:

%time sum([x ** 2 for x in a])

CPU times: user 226 ms, sys: 15.5 ms, total: 242 ms
Wall time: 258 ms

Out[12]:

333177525785

At the time of this writing, it's 205 ms. This doesn't tell us much, let's compare it with numpy's version:

In [13]:

%time np.sum(a ** 2)

CPU times: user 3 ms, sys: 493 µs, total: 3.5 ms
Wall time: 2.84 ms

Out[13]:

333177525785

Numpy's version takes 2.7 µs, which is about 100 times less than the pure python version.

green-divider

Implications when coding numpy¶

In order for numpy to make these operations as efficient as possible, arrays will be allocated in contiguous positions of memory. That means that you can't just change the types of arrays or length without the need of re-mapping the entire array into another memory position:

In [14]:

arr = np.array([1, 2, 3])

In [15]:

arr.dtype

Out[15]:

dtype('int64')

Let's try assigning an float:

In [16]:

arr[1] = 3.5

The array hasn't changed, the decimal part is dropped and we only keep the integer:

In [17]:

arr

Out[17]:

array([1, 3, 3])

Moreover, some operations will fail as types are incompatible:

In [18]:

arr += .5

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-a56470268174> in <module>
----> 1 arr += .5

TypeError: Cannot cast ufunc add output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

purple-divider

Comparing NumPy efficiency vs Pure Python DONE check