Vectorization, Numpy Universal Functions
Contents
Vectorization, Numpy Universal Functions¶
In order to write performant code using numerical libraries, it is useful to keep the following rule in mind:
Code that is predictable can be made fast
The most common example of predictable code is a fixed length loop which performs the same operation at each iteration (up to changes in index)
Examples
Matrix-vector multiplication
Functions applied element-wise to an array
Non-examples:
Code with branch instructions (
if
,else
, etc.)Code with recursive function calls (at least in Python)
One reason why predictable code can be fast is that most CPUs have what is called a branch predictor in them, which pre-loads computation. If a branch is predicted incorrectly, then the CPU has to switch gears and go along the correct brach, which takes time. Code without branches will minimize the number of branch prediction errors, speeding up code.
Another reason why predictable code can be made fast is vectorization. If you are performing the same operation in a predictable way, code can employ special instructions such as AVX which can greatly increase efficiency.
You don’t need to worry about the details in Python, but it is good to know how to write code that allows libraries like NumPy to take advantage of these techniques. Note that standard Python loops will not take advantage of these things - you typically need to use libraries.
Universal Functions¶
Numpy universal functions (or ufuncs) are functions that are applied element-wise to an array. Examples include most math operations and logical comparisons. You can find additional information in the ufunc documentation.
import numpy as np
import time
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
def naive_add(x, y):
"""
add two arrays using a Python for-loop
"""
z = np.empty_like(x)
for i in range(len(x)):
z[i] = x[i] + y[i]
return z
start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))
time for naive add: 3.833e-01 sec
time for numpy add: 3.425e-03 sec
If you prefer to write your code without the for-loop, you can use np.vectorize
. See the documentation for details.
@np.vectorize
def numpy_add_vectorize(x, y):
return x + y
start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.time()
z = numpy_add_vectorize(x, y)
end = time.time()
print("time for numpy add (vectorize): {:0.3e} sec".format(end - start))
time for naive add: 4.000e-01 sec
time for numpy add: 2.246e-03 sec
time for numpy add (vectorize): 1.943e-01 sec
np.vectorize
doesn’t really give a performance boost, but can make defining functions simpler
Exercise¶
perform some timing tests that compare a naive python loop implementation with a numpy ufunc. Try computing
sin
,exp
, multiplication, etc.
## Your code here
More complicated functions¶
Some functions that can be sped up considerably are a bit more complicated than ufuncs. One example is matrix-vector multiplication. We’ll use the notation Ax = y
, where
\begin{equation}
y_i = \sum_j A_{i,j} x_j
\end{equation}
# set up matrix and vector for multiplication
m, n = 500, 1000
A = np.random.randn(m, n)
x = np.random.randn(n)
def naive_matvec(A, x):
"""
naive matrix-vector multiplication implementation
"""
m, n = A.shape
y = np.zeros(m)
for i in range(m):
for j in range(n):
y[i] = y[i] + A[i,j] * x[j]
return y
start = time.time()
y1 = naive_matvec(A, x)
end = time.time()
print("time for naive matvec: {:0.3e} sec".format(end - start))
start = time.time()
# y2 = np.matmul(A, x)
y2 = A @ x
end = time.time()
print("time for numpy matvec: {:0.3e} sec".format(end - start))
np.linalg.norm(y1 - y2)
time for naive matvec: 3.146e-01 sec
time for numpy matvec: 1.425e-02 sec
7.337307652672486e-13
Numba¶
Numba is a just in time (JIT) compiler for Python code. It provides several decorators which make it very easy to get speedups for numerical code in many situations.
Just in time compilation is an increasingly popular solution that bridges the gap between interpreted and compiled languages. Generally:
Interpreted languages (such as Python) simply read code line-by-line and execute as they go along.
Compiled languages (such as C, C++, fortran) compile code into a binary, which can be optimized to run quickly
Compilation takes time intially but saves time when you use the binary. Python libraries such as NumPy and SciPy use compiled libraries under the hood for speed. Interpreted languages tend to be slower, but are easier to develop in.
Just in time compilation will produce a compiled version of a function the first time it is needed. Julia is a relatively new language which uses JIT to produce fast code with less development overhead.
One of the things you need to know to compile code is the types used - if you want to use different types (e.g. single and double precision versions of a function), you need different compiled versions. Python usually allows you to not worry too much about type, but this is one reason why you need to know about it anyways in scientific computing.
First:
$ conda install numba
Let’s look at how Numba can be used with our naive_add
ufunc.
from numba import jit, njit
@jit # this is the only thing we do different
def numba_add(x, y):
"""
add two arrays using a Python for-loop
"""
z = np.empty_like(x)
for i in range(len(x)):
z[i] = x[i] + y[i]
return z
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
time for naive add: 3.947e-01 sec
time for numpy add: 2.283e-03 sec
time for numba add: 1.935e-01 sec
The numba
JIT function runs in about the same time as the naive function. Let’s see what happens when we run the code again:
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)
start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))
start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
time for naive add: 3.745e-01 sec
time for numpy add: 2.261e-03 sec
time for numba add: 3.742e-03 sec
Now the numba
function is much faster. This is because the first time the function is called, it must be compiled. Every subsequent time you call the function, it will run much faster.
The take-away is that it is advantageous to use JIT with functions you will use repeatedly, but not necessarily worth the time for functions you will only use once.
Advanced numba¶
You can get a lot of mileage out of numba
without too much trouble. It is always good to look at the documentation to learn more. Here are a few examples:
Parallelization (this is supported on a handful of known operations).
from numba import prange # parallel range
@jit(nopython=True, parallel=True)
def numba_add_parallel(x, y):
"""
add two arrays using a Python for-loop
"""
z = np.empty_like(x)
for i in prange(len(x)):
z[i] = x[i] + y[i]
return z
# set up two vectors
n = 100_000_000
x = np.random.randn(n)
y = np.random.randn(n)
z = numba_add_parallel(x, y) # precompile
start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))
start = time.time()
z = numba_add_parallel(x, y)
end = time.time()
print("time for numba parallel add: {:0.3e} sec".format(end - start))
time for numba add: 1.033e+00 sec
time for numpy add: 1.108e+00 sec
time for numba parallel add: 2.930e-01 sec
Parallelization of matvec
:
@jit(nopython=True, parallel=True)
def numba_matvec(A, x):
"""
naive matrix-vector multiplication implementation
"""
m, n = A.shape
y = np.zeros(m)
for i in prange(m):
for j in range(n):
y[i] = y[i] + A[i,j] * x[j]
return y
# set up matrix and vector for multiplication
m, n = 4000, 4000
A = np.random.randn(m, n)
x = np.random.randn(n)
y = numba_matvec(A, x) # precompile
start = time.time()
y1 = numba_matvec(A, x)
end = time.time()
print("time for numba parallel matvec: {:0.3e} sec".format(end - start))
start = time.time()
y2 = np.matmul(A, x)
end = time.time()
print("time for numpy matvec: {:0.3e} sec".format(end - start))
np.linalg.norm(y1 - y2)
time for numba parallel matvec: 9.608e-03 sec
time for numpy matvec: 1.259e-02 sec
8.82499895126857e-12
declaring ufuncs without a loop. You can define a function on elements using numba.vectorize
. Unlike numpy.vectorize
, numba
will give you a noticeable speedup. We need to provide a call signature - for example float32(float32, float32)
will take two different single precision floating point numbers (float32
) as input, and return a single precision floating point number as output. See the documentation for additional details.
from numba import vectorize
@vectorize(['float32(float32, float32)']) # call signature defined
def numba_add_vectorize(a, b):
return a + b
# Initialize arrays
n = 100_000_000
x = np.ones(n, dtype=np.float32)
y = np.ones(x.shape, dtype=x.dtype)
z = np.empty_like(x, dtype=x.dtype)
# precompile
z = numba_add_vectorize(x, y)
start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
start = time.time()
z = numba_add_vectorize(x, y)
end = time.time()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))
time for numba add: 3.935e-01 sec
time for numba add (vectorized): 1.437e-01 sec
For NVIDIA GPUS, you can use cuda through numba
- see the Nvidia page for example
from numba import vectorize
@vectorize(['float32(float32, float32)'], target='cuda')
def numba_add_cuda(a, b):
return a + b
# precompile
z = numba_add_cuda(x, y)
start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))
start = time.time()
z = numba_add_vectorize(x, y)
end = time.time()
print("time for numba add (vectorized): {:0.3e} sec".format(end - start))
start = time.time()
z = numba_add_cuda(x, y)
end = time.time()
print("time for numba add (cuda): {:0.3e} sec".format(end - start))
---------------------------------------------------------------------------
CudaAPIError Traceback (most recent call last)
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in initialize(self)
238 _logger.info('init')
--> 239 self.cuInit(0)
240 except CudaAPIError as e:
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in safe_cuda_api_call(*args)
301 retcode = libfn(*args)
--> 302 self._check_error(fname, retcode)
303 return safe_cuda_api_call
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
336 raise CudaDriverError("CUDA initialized before forking")
--> 337 raise CudaAPIError(retcode, msg)
338
CudaAPIError: [999] Call to cuInit results in CUDA_ERROR_UNKNOWN
During handling of the above exception, another exception occurred:
CudaSupportError Traceback (most recent call last)
<ipython-input-20-eb0c68621d6d> in <module>
1 # precompile
----> 2 z = numba_add_cuda(x, y)
3
4 start = time.time()
5 z = numba_add(x, y)
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/dispatcher.py in __call__(self, *args, **kws)
39 the input arguments.
40 """
---> 41 return CUDAUFuncMechanism.call(self.functions, args, kws)
42
43 def reduce(self, arg, stream=0):
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/np/ufunc/deviceufunc.py in call(cls, typemap, args, kws)
289 any_device = True
290 else:
--> 291 dev_a = cr.to_device(a, stream=stream)
292 devarys.append(dev_a)
293
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/dispatcher.py in to_device(self, hostary, stream)
159
160 def to_device(self, hostary, stream):
--> 161 return cuda.to_device(hostary, stream=stream)
162
163 def to_host(self, devary, stream):
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py in _require_cuda_context(*args, **kws)
221 @functools.wraps(fn)
222 def _require_cuda_context(*args, **kws):
--> 223 with _runtime.ensure_context():
224 return fn(*args, **kws)
225
~/miniconda3/envs/pycourse/lib/python3.8/contextlib.py in __enter__(self)
111 del self.args, self.kwds, self.func
112 try:
--> 113 return next(self.gen)
114 except StopIteration:
115 raise RuntimeError("generator didn't yield") from None
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py in ensure_context(self)
119 any top-level Numba CUDA API.
120 """
--> 121 with driver.get_active_context():
122 oldctx = self._get_attached_context()
123 newctx = self.get_or_create_context(None)
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in __enter__(self)
393 else:
394 hctx = drvapi.cu_context(0)
--> 395 driver.cuCtxGetCurrent(byref(hctx))
396 hctx = hctx if hctx.value else None
397
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in __getattr__(self, fname)
280 # Initialize driver
281 if not self.is_initialized:
--> 282 self.initialize()
283
284 if self.initialization_error is not None:
~/miniconda3/envs/pycourse/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py in initialize(self)
240 except CudaAPIError as e:
241 self.initialization_error = e
--> 242 raise CudaSupportError("Error at driver init: \n%s:" % e)
243 else:
244 self.pid = _getpid()
CudaSupportError: Error at driver init:
[999] Call to cuInit results in CUDA_ERROR_UNKNOWN:
We see that there may be a bit of a speedup using GPU - this will depend on the type of operation being performed, and if the GPU will be better suited for the computation compared to a vectorized instruction on a CPU.
For more information, see performance hints