Vectorization, Numpy Universal Functions

In order to write performant code using numerical libraries, it is useful to keep the following rule in mind:

Code that is predictable can be made fast

The most common example of predictable code is a fixed length loop which performs the same operation at each iteration (up to changes in index)

Examples

  1. Matrix-vector multiplication

  2. Functions applied element-wise to an array

Non-examples:

  1. Code with branch instructions (if, else, etc.)

  2. Code with recursive function calls (at least in Python)

One reason why predictable code can be fast is that most CPUs have what is called a branch predictor in them, which pre-loads computation. If a branch is predicted incorrectly, then the CPU has to switch gears and go along the correct brach, which takes time. Code without branches will minimize the number of branch prediction errors, speeding up code.

Another reason why predictable code can be made fast is vectorization. If you are performing the same operation in a predictable way, code can employ special instructions such as AVX which can greatly increase efficiency.

You don’t need to worry about the details in Python, but it is good to know how to write code that allows libraries like NumPy to take advantage of these techniques. Note that standard Python loops will not take advantage of these things - you typically need to use libraries.

Universal Functions

Numpy universal functions (or ufuncs) are functions that are applied element-wise to an array. Examples include most math operations and logical comparisons. You can find additional information in the ufunc documentation.

import numpy as np
import time
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
def naive_add(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in range(len(x)):
        z[i] = x[i] + y[i]
        
    return z
start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))
time for naive add: 3.833e-01 sec
time for numpy add: 3.425e-03 sec

If you prefer to write your code without the for-loop, you can use np.vectorize. See the documentation for details.

@np.vectorize
def numpy_add_vectorize(x, y):
    return x + y

start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numpy_add_vectorize(x, y)
end = time.time()
print("time for numpy add (vectorize): {:0.3e} sec".format(end - start))
time for naive add: 4.000e-01 sec
time for numpy add: 2.246e-03 sec
time for numpy add (vectorize): 1.943e-01 sec

np.vectorize doesn’t really give a performance boost, but can make defining functions simpler

Exercise

  1. perform some timing tests that compare a naive python loop implementation with a numpy ufunc. Try computing sin, exp, multiplication, etc.

## Your code here

More complicated functions

Some functions that can be sped up considerably are a bit more complicated than ufuncs. One example is matrix-vector multiplication. We’ll use the notation Ax = y, where \begin{equation} y_i = \sum_j A_{i,j} x_j \end{equation}

# set up matrix and vector for multiplication
m, n = 500, 1000
A = np.random.randn(m, n)
x = np.random.randn(n)
def naive_matvec(A, x):
    """
    naive matrix-vector multiplication implementation
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in range(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y
start = time.time()
y1 = naive_matvec(A, x)
end = time.time()
print("time for naive matvec: {:0.3e} sec".format(end - start))

start = time.time()
# y2 = np.matmul(A, x)
y2 = A @ x
end = time.time()
print("time for numpy matvec: {:0.3e} sec".format(end - start))

np.linalg.norm(y1 - y2)
time for naive matvec: 3.146e-01 sec
time for numpy matvec: 1.425e-02 sec
7.337307652672486e-13

Numba

Numba is a just in time (JIT) compiler for Python code. It provides several decorators which make it very easy to get speedups for numerical code in many situations.

Just in time compilation is an increasingly popular solution that bridges the gap between interpreted and compiled languages. Generally:

  • Interpreted languages (such as Python) simply read code line-by-line and execute as they go along.

  • Compiled languages (such as C, C++, fortran) compile code into a binary, which can be optimized to run quickly

Compilation takes time intially but saves time when you use the binary. Python libraries such as NumPy and SciPy use compiled libraries under the hood for speed. Interpreted languages tend to be slower, but are easier to develop in.

Just in time compilation will produce a compiled version of a function the first time it is needed. Julia is a relatively new language which uses JIT to produce fast code with less development overhead.

One of the things you need to know to compile code is the types used - if you want to use different types (e.g. single and double precision versions of a function), you need different compiled versions. Python usually allows you to not worry too much about type, but this is one reason why you need to know about it anyways in scientific computing.

First:

$ conda install numba

Let’s look at how Numba can be used with our naive_add ufunc.

from numba import jit, njit

@jit # this is the only thing we do different
def numba_add(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in range(len(x)):
        z[i] = x[i] + y[i]
        
    return z
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)

start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
time for naive add: 3.947e-01 sec
time for numpy add: 2.283e-03 sec
time for numba add: 1.935e-01 sec

The numba JIT function runs in about the same time as the naive function. Let’s see what happens when we run the code again:

# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)

start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
time for naive add: 3.745e-01 sec
time for numpy add: 2.261e-03 sec
time for numba add: 3.742e-03 sec

Now the numba function is much faster. This is because the first time the function is called, it must be compiled. Every subsequent time you call the function, it will run much faster.

The take-away is that it is advantageous to use JIT with functions you will use repeatedly, but not necessarily worth the time for functions you will only use once.

Advanced numba

You can get a lot of mileage out of numba without too much trouble. It is always good to look at the documentation to learn more. Here are a few examples:

Parallelization (this is supported on a handful of known operations).

from numba import prange # parallel range

@jit(nopython=True, parallel=True)
def numba_add_parallel(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in prange(len(x)):
        z[i] = x[i] + y[i]
        
    return z
# set up two vectors
n = 100_000_000
x = np.random.randn(n)
y = np.random.randn(n)

z = numba_add_parallel(x, y) # precompile

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add_parallel(x, y)
end = time.time()
print("time for numba parallel add: {:0.3e} sec".format(end - start))
time for numba add: 1.033e+00 sec
time for numpy add: 1.108e+00 sec
time for numba parallel add: 2.930e-01 sec

Parallelization of matvec:

@jit(nopython=True, parallel=True)
def numba_matvec(A, x):
    """
    naive matrix-vector multiplication implementation
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in prange(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y
# set up matrix and vector for multiplication
m, n = 4000, 4000
A = np.random.randn(m, n)
x = np.random.randn(n)

y = numba_matvec(A, x) # precompile

start = time.time()
y1 = numba_matvec(A, x)
end = time.time()
print("time for numba parallel matvec: {:0.3e} sec".format(end - start))

start = time.time()
y2 = np.matmul(A, x)
end = time.time()
print("time for numpy matvec: {:0.3e} sec".format(end - start))

np.linalg.norm(y1 - y2)
time for numba parallel matvec: 9.608e-03 sec
time for numpy matvec: 1.259e-02 sec
8.82499895126857e-12

declaring ufuncs without a loop. You can define a function on elements using numba.vectorize. Unlike numpy.vectorize, numba will give you a noticeable speedup. We need to provide a call signature - for example float32(float32, float32) will take two different single precision floating point numbers (float32) as input, and return a single precision floating point number as output. See the documentation for additional details.

from numba import vectorize

@vectorize(['float32(float32, float32)']) # call signature defined
def numba_add_vectorize(a, b):
    return a + b
# Initialize arrays
n = 100_000_000
x = np.ones(n, dtype=np.float32)
y = np.ones(x.shape, dtype=x.dtype)
z = np.empty_like(x, dtype=x.dtype)

# precompile
z = numba_add_vectorize(x, y)

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add_vectorize(x, y)
end = time.time()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))
time for numba add: 3.935e-01 sec
time for numba add (vectorized): 1.437e-01 sec

For NVIDIA GPUS, you can use cuda through numba - see the Nvidia page for example

from numba import vectorize

@vectorize(['float32(float32, float32)'], target='cuda')
def numba_add_cuda(a, b):
    return a + b
# precompile
z = numba_add_cuda(x, y)

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add_vectorize(x, y)
end = time.time()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add_cuda(x, y)
end = time.time()
print("time for numba add (cuda): {:0.3e} sec".format(end - start))
---------------------------------------------------------------------------
CudaAPIError                              Traceback (most recent call last)
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in initialize(self)
    238             _logger.info('init')
--> 239             self.cuInit(0)
    240         except CudaAPIError as e:

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in safe_cuda_api_call(*args)
    301             retcode = libfn(*args)
--> 302             self._check_error(fname, retcode)
    303         return safe_cuda_api_call

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
    336                     raise CudaDriverError("CUDA initialized before forking")
--> 337             raise CudaAPIError(retcode, msg)
    338 

CudaAPIError: [999] Call to cuInit results in CUDA_ERROR_UNKNOWN

During handling of the above exception, another exception occurred:

CudaSupportError                          Traceback (most recent call last)
<ipython-input-20-eb0c68621d6d> in <module>
      1 # precompile
----> 2 z = numba_add_cuda(x, y)
      3 
      4 start = time.time()
      5 z = numba_add(x, y)

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/dispatcher.py in __call__(self, *args, **kws)
     39                       the input arguments.
     40         """
---> 41         return CUDAUFuncMechanism.call(self.functions, args, kws)
     42 
     43     def reduce(self, arg, stream=0):

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/np/ufunc/deviceufunc.py in call(cls, typemap, args, kws)
    289                 any_device = True
    290             else:
--> 291                 dev_a = cr.to_device(a, stream=stream)
    292                 devarys.append(dev_a)
    293 

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/dispatcher.py in to_device(self, hostary, stream)
    159 
    160     def to_device(self, hostary, stream):
--> 161         return cuda.to_device(hostary, stream=stream)
    162 
    163     def to_host(self, devary, stream):

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py in _require_cuda_context(*args, **kws)
    221     @functools.wraps(fn)
    222     def _require_cuda_context(*args, **kws):
--> 223         with _runtime.ensure_context():
    224             return fn(*args, **kws)
    225 

~/miniconda3/envs/pycourse/lib/python3.8/contextlib.py in __enter__(self)
    111         del self.args, self.kwds, self.func
    112         try:
--> 113             return next(self.gen)
    114         except StopIteration:
    115             raise RuntimeError("generator didn't yield") from None

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py in ensure_context(self)
    119         any top-level Numba CUDA API.
    120         """
--> 121         with driver.get_active_context():
    122             oldctx = self._get_attached_context()
    123             newctx = self.get_or_create_context(None)

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in __enter__(self)
    393         else:
    394             hctx = drvapi.cu_context(0)
--> 395             driver.cuCtxGetCurrent(byref(hctx))
    396             hctx = hctx if hctx.value else None
    397 

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in __getattr__(self, fname)
    280         # Initialize driver
    281         if not self.is_initialized:
--> 282             self.initialize()
    283 
    284         if self.initialization_error is not None:

~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in initialize(self)
    240         except CudaAPIError as e:
    241             self.initialization_error = e
--> 242             raise CudaSupportError("Error at driver init: \n%s:" % e)
    243         else:
    244             self.pid = _getpid()

CudaSupportError: Error at driver init: 
[999] Call to cuInit results in CUDA_ERROR_UNKNOWN:

We see that there may be a bit of a speedup using GPU - this will depend on the type of operation being performed, and if the GPU will be better suited for the computation compared to a vectorized instruction on a CPU.

For more information, see performance hints