Speed Optimization Basics: Numba¶

When to use Numba¶

Numba works well when the code relies a lot on (1) numpy, (2) loops, and/or (2) cuda.
Hence, we would like to maximize the use of numba in our code where possible where there are loops/numpy

Numba CPU: nopython¶

For a basic numba application, we can cecorate python function thus allowing it to run without python interpreter
Essentially, it will compile the function with specific arguments once into machine code, then uses the cache subsequently

With Numba: no python¶

from numba import jit, prange
import numpy as np

# Numpy array of 10k elements
input_ndarray = np.random.rand(10000).reshape(10000)

# This is the only extra line of code you need to add
# which is a decorator
@jit(nopython=True)
def go_fast(a):
    trace = 0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i])
    return a + trace

%timeit go_fast(input_ndarray)

161 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Without numba¶

This is much slower, time measured in the millisecond space rather than microsecond with @jit(nopython=True) or @njit

# Without numba: notice how this is really slow
def go_normal(a):
    trace = 0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i])
    return a + trace

%timeit go_normal(input_ndarray)

10.5 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numba CPU: parallel¶

Here, instead of the normal range() function we would use for loops, we would need to use prange() which allows us to execute the loops in parallel on separate threads
As you can see, it's slightly faster than @jit(nopython=True)

@jit(nopython=True, parallel=True)
def go_even_faster(a):
    trace = 0
    for i in prange(a.shape[0]):
        trace += np.tanh(a[i])
    return a + trace

%timeit go_even_faster(input_ndarray)

148 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba CPU: fastmath¶

What if we relax our condition of strictly adhering to IEEE 754.
We can have faster performance (depends)
I would say this is the least additional speed-up unless you really dig into areas where fastmath=True thrives

@jit(nopython=True, parallel=True, fastmath=True)
def go_super_fast(a):
    trace = 0
    for i in prange(a.shape[0]):
        trace += np.tanh(a[i])
    return a + trace

%timeit go_super_fast(input_ndarray)

113 µs ± 39.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary¶

When to use Numba
- (1) numpy array or torch tensors,
- (2) loops, and/or
- (3) cuda
Numba CPU: nopython¶
Numba CPU: parallel
Numba CPU: fastmath

Speed Optimization Basics: Numba¶

When to use Numba¶

Numba CPU: nopython¶

With Numba: no python¶

Without numba¶

Numba CPU: parallel¶

Numba CPU: fastmath¶

Summary¶

Comments