Memory and Performance

This section will cover memory layout in numpy arrays.

%pylab inline
Populating the interactive namespace from numpy and matplotlib

Let’s start with an example - summing up rows or columns of an array. The computational complexity of each function is exactly the same: \(n\) additions.

n = 2000
x = np.random.rand(n, n)
%timeit x[0].sum()
3.8 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x[:,0].sum()
5.9 µs ± 930 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note there is a noticable difference in how long it takes the loop to run. Summing over rows is faster

x = np.array(x, order='F')
%timeit x[0].sum()
5.41 µs ± 291 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x[:,0].sum()
3.42 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Now summing over columns is faster.

Why did this order flag matter?

Physical Memory

There are several places you may have data stored on your computer:

  1. Hard Drive/SSD O(TB)

  2. Random Access Memory O(10GB)

  3. L3 Cache O(10MB)

  4. L2 Cache O(100KB)

  5. L1 Cache O(64KB)

  6. CPU (registers)

At the top of the list, we have a lot of storage, but it takes longer to access data (in terms of clock cycles). As we go down the list, the amount of available storage decreases, but we are able to access data much faster. We can access CPU registers in 1 clock cycle, but we can only hold a small number (e.g. 8) 64-bit floats.

When you loop over an array, memory is copied from one location to the next down the list. Usually you’re going to start in RAM, where you generate data or load from disk.

Every element in an array has a unique address (in RAM) - for a 64-bit architecture, every address is for a 64-bit (8-byte) word. This will hold exactly one 64-bit double.

Typically, arrays are stored in contiguous blocks of memory, meaning that the first element is stored at some address a, the second element is stored at the address a+1 (64-bits later), and generally the ith element is stored at address a+i

An array will contain a pointer to the start of the data (i.e. the address of the start of the array), and when you get an element of an array x[i], you first look up the starting address a, and then look up the data in address a+i.

x = np.arange(5)
print(x)
a = x.data # address of start of array
print(a)
[0 1 2 3 4]
<memory at 0x7fc0e11a6580>

Note 0x40 = 64

address

data

0x00

0

0x40

1

0x80

2

0xb0

3

When you access some element of memory for a computation in the CPU, you don’t just move that one address to cache, but a whole block of memory around that address. That way when you look at the next address, it is pre-loaded in cache, and you will be able to access the data in that address much faster.

When an address you want to access is not loaded in cache, it is called a cache miss and it takes extra time to move that memory to cache and then to the CPU. Code that minimizes the number of cache misses will be much faster than code that maximizes the number of cache misses.

This is why looping over an array in memory order is much faster than looping in a different order.

Two-dimensional arrays

Everything we’ve said so far is fairly straightforward for 1-dimensional arrays. What about multi-dimensional arrays?

We’ll consider 2-dimensional arrays, but these ideas generalize to higher dimensions.

Two-dimensional arrays have two indices, so at most one of them can be used to access adjacent addresses in memory. If we store rows (1st index) of a 2-dimensional array in contiguous memory, we say the array is in row major format. If we store columns (2nd index) of a 2-dimensional array in contiguous memory, we say the array is in column major format.

C/C++ and Python store multi-dimensional arrays in row major format

Fortran, Matlab, R, and Julia store multi-dimensional arrays in column major format

Because numerical libraries are often written in C/C++ or Fortran, there will be no end to having to worry about row vs. column major formats in scientific computing. However, for a variety of reasons column major is considered the default (in particular, BLAS is column-major).

Numpy supports both row and column major format, but is row-major by default. You can see this by accessing the flags field

x = np.random.rand(3,3)
x.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

C_CONTIGUOUS (C default) is refering to row-major format. F_CONTIGUOUS (fortran default) is referring to column-major.

The order flag in array creation can be used to manupulate this.

print("Row major:")
x = np.array(np.random.rand(3,3), order='C')
print(x.flags)

print("Column major:")
x = np.array(np.random.rand(3,3), order='F')
print(x.flags)
Row major:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Column major:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Exercises

  1. Are 1-dimensional arrays row or column major? Look at the flags field of a numpy array.

  2. Go back to our example of computing the sum of rows and columns. Can you explain the timing differences now?

# Your code here
x = np.random.randn(5)
x.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Numba Examples

Memory storage has implications for how you may wish to loop over arrays in general. We’ll use Numba to demonstrate this, because naive Python loops have too much overhead to see a big difference.

from numba import jit

@jit(nopython=True) # throws error if not able to compile
def numba_matvec_row(A, x):
    """
    naive matrix-vector multiplication implementation
    
    Loops over rows in outer loop
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in range(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y


@jit(nopython=True) # throws error if not able to compile
def numba_matvec_col(A, x):
    """
    naive matrix-vector multiplication implementation
    
    Loops over columns in outer loop
    """
    m, n = A.shape
    y = np.zeros(m)
    for j in range(n):
        for i in range(m):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y
m = 4000
n = 4000
A = np.array(np.random.randn(m,n), order='C')
x = np.random.randn(n)

# precompile
y = numba_matvec_row(A, x)
y = numba_matvec_col(A, x)
%timeit y = numba_matvec_row(A, x)
25.1 ms ± 982 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = numba_matvec_col(A, x)
47 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
m = 4000
n = 4000
A = np.array(np.random.randn(m,n), order='F')
x = np.random.randn(n)

# precompile
y = numba_matvec_row(A, x)
y = numba_matvec_col(A, x)
%timeit y = numba_matvec_row(A, x)
45.8 ms ± 4.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = numba_matvec_col(A, x)
11.3 ms ± 889 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For some reason, we’re seeing a bigger difference when A is in column major order. Let’s see if we can improve the row major function. We’ll use NumPy’s dot function to take explicit advantage of the contiguous row layout. (Note that Numba is generally compatible with built-in NumPy functions).

@jit(nopython=True) # throws error if not able to compile
def numba_matvec_row2(A, x):
    """
    naive matrix-vector multiplication implementation
    
    Loops over rows in outer loop
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in range(m):
        y[i] = np.dot(A[i], x) # takes explicit advantage of the contigous row layout
    
    return y
m = 4000
n = 4000
A = np.array(np.random.randn(m,n), order='C')
x = np.random.randn(n)

# precompile
y = numba_matvec_row(A, x)
y = numba_matvec_row2(A, x)
%timeit y = numba_matvec_row(A, x)
22.3 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = numba_matvec_row2(A, x)
9.99 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now, we finally see a speedup. Let’s compare to Numpy

%timeit y = np.matmul(A, x)
9.7 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numba still isn’t quite as fast. Let’s see if we can use some parallelism to help us out.

from numba import prange

@jit(nopython=True, parallel=True) # throws error if not able to compile
def numba_matvec_row3(A, x):
    """
    naive matrix-vector multiplication implementation
    
    Loops over rows in outer loop
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in prange(m):
        y[i] = np.dot(A[i], x) # takes explicit advantage of the contigous row layout
    
    return y
m = 4000
n = 4000
A = np.array(np.random.randn(m,n), order='C')
x = np.random.randn(n)

# precompile
y = numba_matvec_row3(A, x)
%timeit y = numba_matvec_row3(A, x)
9.57 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit y = np.matmul(A, x)
9.17 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can conclude that looping over the matrix A in different orders can make a noticeable difference. However, even using Numba, it is hard to beat NumPy’s built-in capabilities.