BLAS and LAPACK

We’ve seen a bit of dense linear algebra using numpy and scipy. Now we’re going to look under the hood.

Regardless of what language you’re using, chances are if you’re doing numerical linear algebra, you are able to take advantage of libraries of code which implement most common linear algebra routines and factorizations.

Basic Linear Algebra Subprograms (BLAS)

Linear Algebra PACKage (LAPACK)

These libraries are wrapped by scipy, which exposes an interface.

Why should you care?

First, you probably shouldn’t be writing your own basic linear algebra routines if you can avoid it.

  1. It takes time to write them

  2. Even if you know what you’re doing, there’s a chance you have a bug

  3. Performance optimization is involved

It is entirely possible to do linear algebra in Python without ever worrying about the libraries under the hood. However, maybe you are prototyping an algorithm in Python, and then want to write compiled/optimized code in C/fortran. In this case, it is good to be able to translate what you’re doing into BLAS/LAPACK routines.

View Your Configuration

You can view what BLAS and LAPACK libraries NumPy is using

%pylab inline
import scipy as sp
import scipy.linalg as la

np.__config__.show()
Populating the interactive namespace from numpy and matplotlib
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']

the above show the libraries mkl_rt, indicating that the system is using Intel’s math kernel library (MKL) - this is a library of mathematical functions (including BLAS and LAPACK) which is optimized for Intel CPUs, and is the default for Anaconda Python.

You can do the same for scipy:

sp.__config__.show()
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/brad/miniconda3/envs/pycourse/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/brad/miniconda3/envs/pycourse/include']

BLAS Routines

scipy BLAS interface

BLAS implements basic linear algebra routines like dot product, matrix-vector product, and matrix-matrix product as well as triangular solves.

It is written in Fortran, so will be easiest to use if you set the flag order='F' when constructing arrays

from scipy.linalg import blas

n = 4096
A = np.array(np.random.randn(n,n), order='F')
B = np.array(np.random.randn(n,n), order='F')


C_np = A @ B # numpy  C = A * B
C_blas = blas.dgemm(1.0, A, B) # BLAS C = A * B
np.linalg.norm(C_np - C_blas)
6.536062998606052e-12
%timeit C_np = np.matmul(A,B) # numpy
1.18 s ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit C_blas = blas.dgemm(1.0, A, B) # blas
1.22 s ± 79.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

BLAS Naming Conventions

BLAS functions have short names (like dgemm) that look a little cryptic at first. There is a pattern to the names though. See here for reference.

blas.<character><name><mod>

The <character> field can be

  1. s: single precision float

  2. d: double precision float

  3. c: single precision complex float

  4. z: double precision complex float

The <name> field indicate either the operation type, or matrix type.

The <mod> field indicates additional details about the operation.

For example, dgemm = d + ge + mm has d as the <character>, ge for <name> (general matrix), and mm for <mod> (matrix multiplication).

BLAS Levels

BLAS is split up into 3 conceptual levels

  1. Level 1 - vector-vector operations

  2. Level 2 - matrix-vector operations

  3. Level 3 - matrix-matrix operations

Level 3 operations are most efficient, since they can take the most advantage of memory performance optimizations e.g. minimizing cache misses.

FLOPS

FLOPS are Floating-Point Operations Per Second, and are one way to measure the power of numerical code. Floating point operations are counted as the number of +, *, /, - applied to floating point numbers.

  • A human is capable of <1 flop

  • Your CPU is probably capable of several Giga-flops (\(10^9\))

  • A high end GPU will be measured in Tera-flops (\(10^{12}\))

  • Super computers are mostly O(100) Peta-flops (\(10^{17}\)). Some are close to an Exa-flop (\(10^{18}\))

BLAS Level 1

Some vector-vector operations (insert the appropriate <character>)

  1. axpy (\(a x + y\))

  2. dot (dot product \(x^H x\))

  3. nrm2 (\(\|x\|_2\))

See the SciPy BLAS reference

n = 1000
x = np.random.rand(n)
y = np.random.rand(n)

z = blas.daxpy(x, y, a=1.0)
xx = blas.ddot(x, x)
x2 = blas.dnrm2(x)
np.sqrt(xx) - x2
3.552713678800501e-15
import time
# measure FLOPs of fdot
n = 10_000_000
dtype = np.float32
x = np.array(np.random.randn(n), dtype=dtype)
y = np.array(np.random.randn(n), dtype=dtype)
print("measuring BLAS Level 1 flops using {}".format(dtype))
t0 = time.time()
xy = blas.sdot(x,y)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2*n) / (t1 - t0) # 2N FLOP for dot - one multiplication and one addition per entry
print("{:e} FLOPS".format(flops))
measuring BLAS Level 1 flops using <class 'numpy.float32'>
time elapsed = 0.006452083587646484 sec.
3.099774e+09 FLOPS

This is consistent with a CPU running at ~3 GHz

BLAS Level 2

Some matrix-vector operations (insert appropriate <character>)

  1. gemv \(\alpha A x\)

  2. trsv \(L^{-1} x\) (triangular solve)

  3. trmv \(L x\) (triangular matrix-vector product)

See the SciPy BLAS reference

# measure FLOPs of gemv
n = 4096
dtype = np.float32
A = np.array(np.random.randn(n, n), dtype=dtype)
x = np.array(np.random.randn(n), dtype=dtype)
print("measuring BLAS Level 2 flops using {}".format(dtype))
t0 = time.time()
y = blas.sgemv(1.0, A, x)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2 * n**2) / (t1 - t0) # 2 n**2 FLOP - one multiplication and one addition per entry of A
print("{:e} FLOPS".format(flops))
measuring BLAS Level 2 flops using <class 'numpy.float32'>
time elapsed = 0.201141357421875 sec.
1.668202e+08 FLOPS

BLAS Level 3

Some matrix-matrix operations (insert appropriate <character>)

  1. gemm \(\alpha AB\)

  2. syrk \(\alpha A A^T\)

  3. trmm \(\alpha L_1 L_2\) (triangular multiplication)

See the SciPy BLAS reference

# measure FLOPs of gemm
n = 4096
dtype = np.float32
A = np.array(np.random.randn(n, n), dtype=dtype)
B = np.array(np.random.randn(n, n), dtype=dtype)
print("measuring BLAS Level 3 flops using {}".format(dtype))
t0 = time.time()
C = blas.sgemm(1.0, A, B)
t1 = time.time()
print("time elapsed = {} sec.".format(t1 - t0))
flops = (2 * n**3) / (t1 - t0) # 2 n**3 FLOP - n multiplications and additions per entry of C
print("{:e} FLOPS".format(flops))
measuring BLAS Level 3 flops using <class 'numpy.float32'>
time elapsed = 1.088022232055664 sec.
1.263200e+11 FLOPS

From our experminets, we see that the BLAS level 3 operation gives us the most FLOPS.

LAPACK Routines

scipy LAPACK interface

from scipy.linalg import lapack

LAPACK is a library of linear algebra routines that go beyond basic operations. These include routines for various factorizations and eigenvalue and singular value decompositions.

Again, the names are a bit cryptic, and it is worth searching online (and reading documentation) to figure out how to call the right functions.

Example: Obtain a orthonormal basis for columns of a matrix

In a variety of situations it is convenient to be able to obtain an orthonormal basis for columns of a matrix A (e.g. subspace iteration, randomized numerical linear algebra). One way to do this is to use the QR decomposition:

m = 1000
n = 100
A = np.array(np.random.randn(m, n), order='F')

def get_Q_qr(A):
    Q, R = la.qr(A, mode='economic')
    return Q

Q = get_Q_qr(A)

print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
(1000, 100)
3.8251165998905494e-15

We can get this using LAPACK routines as well:

def get_Q_lapack(A):
    qr, tau, work, info = lapack.dgeqrf(A)
    Q, work, info = lapack.dorgqr(qr, tau)
    return Q

A = np.array(np.random.randn(m, n), order='F')
Q = get_Q_lapack(A)

print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
(1000, 100)
4.00715689225618e-15

You can also do many LAPACK (and BLAS) routines in-place, meaning you can overwrite the matrix. This saves time used for memory allocation

def get_Q_inplace(A):
    m, n = A.shape
    lwork = max(3*n, 1)
    qr, tau, work, info = lapack.dgeqrf(A, lwork, 1) # overwrite A = True
    Q, work, info = lapack.dorgqr(qr, tau, lwork, 1) # overwrite qr = True
    return Q

A = np.array(np.random.randn(m, n), order='F')
Q = get_Q_inplace(A)

print(Q.shape)
print(np.linalg.norm(Q.T @ Q - np.eye(n))) # measure how close Q is to orthogonal
print(np.linalg.norm(A - Q)) # A now contains Q
(1000, 100)
3.932829855433616e-15
0.0
m = 4096
n = 128

for get_Q in (get_Q_qr, get_Q_lapack, get_Q_inplace):
    
    print("timing {}".format(get_Q))
    A = np.array(np.random.randn(m, n), order='F')
    t0 = time.time()
    Q = get_Q(A)
    t1 = time.time()
    print("  time elapsed = {} sec.".format(t1 - t0))  
timing <function get_Q_qr at 0x7f6f27bcdf70>
  time elapsed = 0.019899606704711914 sec.
timing <function get_Q_lapack at 0x7f6efd9e1dc0>
  time elapsed = 0.016045093536376953 sec.
timing <function get_Q_inplace at 0x7f6efd9e1e50>
  time elapsed = 0.0139617919921875 sec.

As we see, the in-place version is fastest.

Exercise

One way to get the top k-dimensional eigenspace (eigenpace with k-largest eigenvalues) is using a subspace version of the power method. Here is some pseudocode:


# get top k-dimensional eigenspace of symmetric matrix A
m, n = A.shape # m should equal n
Q = random n x k matrix
Q, R = qr(Q) # orthogonalize
for some number of iterations:
    Q = A @ Q
    Q, R = qr(Q) # re-othogonalize
    
return Q

Implement a function that realizes this algorithm. Use BLAS to implement matrix-matrix multiplication, and use LAPACK to do the orthogonalization of Q.

Compare this to the span of the top-k eigenvectors of A obtained via eigh.

## Your Code here

Exercise

Implement a version of solve for a square matrix A which uses LAPACK for the LU decomposition, and BLAS for the triangular solves. You may want to look at dgetrf (the lu return object has L in the lower triangular part, and U in the upper triangular part)

## Your code here