Distances

A common task when dealing with data is computing the distance between two points.

We can use scipy.spatial.distance to compute a variety of distances.

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
import scipy.spatial.distance as distance

A data set is a collection of observations, each of which may have several features. We’ll consider the situation where the data set is a matrix X, where each row X[i] is an observation. We’ll use n to denote the number of observations and p to denote the number of features, so X is a \(n \times p\) matrix.

For example, we might sample from a circle (with some gaussian noise)

def sample_circle(n, r=1, sigma=0.1):
    """
    sample n points from a circle of radius r
    add Gaussian noise with variance sigma^2
    """
    X = np.random.randn(n,2)
    X = r * X / np.linalg.norm(X,axis=1).reshape(-1,1)
    return X + sigma * np.random.randn(n,2)

X = sample_circle(100)
plt.scatter(X[:,0], X[:,1])
plt.show()
../_images/distances_3_0.png

We might also call a data set a “point cloud”, or a collection of points in some space.

Similarity, Dissimilarity, and Metric

A similarity between two points \(x, y\in X\) is a function \(s: X \times X \to \mathbb{R}\), where \(s\) is larger if \(x,y\) are more similar.

Cosine similarity is an an example of similarity for points in a real vector space. We can define it as \begin{equation} c(x,y) = \frac{x^T y}{|x|_2 |y|_2} \end{equation}

Note \(c(x,x) = 1\), and if \(x,y\) are orthogonal, then \(c(x,y) = 0\).

A dissimilarity between two points \(x,y \in X\) is a function \(d: X \times X \to \mathbb{R}_+\), where \(d\) is smaller if \(x,y\) are more similar. Sometimes, disimilarity functions will be called distances.

Cosine distance is an example of a dissimilarity for points in a real vector space. It is defined as \begin{equation} d(x,y) = 1 - c(x,y) \end{equation} Note \(d(x,x) = 0\), and \(d(x,y) = 1\) if \(x,y\) are orthogonal.

A metric is a disimilarity \(d\) that satisfies the metric axioms

  1. \(d(x,y) = 0\) iff \(x=y\) (identity)

  2. \(d(x,y) = d(y,x)\) (symmetry)

  3. \(d(x,y) \le d(x,z) + d(z,y)\) (triangle inequality)

You are probably most familiar with the Euclidean metric \begin{equation} d(x,y) = |x - y|_2 \end{equation}

Exercise

Write functions for the cosine similarity, cosine distance, and euclidean distance between two numpy arrays treated as vectors.

## Your code here
import numpy as np

cos_sim = lambda x, y : np.dot(x,y) / np.sqrt(np.dot(x,x) * np.dot(y,y))
cos_dis = lambda x, y : 1 - cos_sim(x,y)
euc_dis = lambda x, y : np.linalg.norm(x - y, 2)

x = np.array([1,2,3])
y = np.array([-1,2,3])
print(cos_sim(x,x), cos_dis(x,x), euc_dis(x,x))
print(cos_sim(x,-x), cos_dis(x,-x), euc_dis(x,-x))
print(cos_sim(x,y), cos_dis(x,y), euc_dis(x,y))
1.0 0.0 0.0
-1.0 2.0 7.483314773547883
0.8571428571428571 0.1428571428571429 2.0
X = np.random.randn(100, 3)
X[0].T @ X[0]
2.8389823765022792
X @ X.T
array([[ 2.83898238,  1.16304231, -2.32948147, ..., -3.38312712,
         1.07988589, -0.24763674],
       [ 1.16304231,  8.2530741 , -1.27677546, ...,  0.17204936,
         3.0345908 , -2.39239586],
       [-2.32948147, -1.27677546,  2.23607767, ...,  3.43250478,
        -0.99733681,  0.47330705],
       ...,
       [-3.38312712,  0.17204936,  3.43250478, ...,  6.01431271,
        -0.77625925,  0.24180186],
       [ 1.07988589,  3.0345908 , -0.99733681, ..., -0.77625925,
         1.27487243, -0.85996136],
       [-0.24763674, -2.39239586,  0.47330705, ...,  0.24180186,
        -0.85996136,  0.795015  ]])
x = np.random.randn(1,3)
y = np.random.randn(1,3)

Solution:

# The function for the cosine similarity:
def cosimi(x,y):
    num = x.T @ y
    denum = la.norm(x) * la.norm(y)
    return num/denum

# The functions for the cosine distance:
def cosdis(x,y):
    num = x.T @ y
    denum = la.norm(x) * la.norm(y)
    cosimi = num/denum
    return 1-cosimi

# The function for the euclidean distance:
def euclidis(x,y):
    return la.norm(x-y)

Computing Distances

SciPy provides a variety of functionality for computing distances in scipy.spatial.distance.

x = [1.0, 0.0]
y = [0.0, 1.0]

distance.euclidean(x, y) # sqrt(2)
1.4142135623730951
distance.cosine(x, y)
1.0

the \(L_1\) / Manhattan / City block distance \begin{equation} d(x,y) = |x - y|_1 = \sum_i |x_i - y_i| \end{equation}

distance.cityblock(x, y)
2.0

The \(L_\infty\) / Chebyshev distance \begin{equation} d(x,y) = |x - y|_\infty = \max_i |x_i - y_i| \end{equation}

distance.chebyshev(x, y)
1.0

More generally, the Minkowski distance \begin{equation} d(x,y) = |x - y|_p = \big( \sum_i (x_i - y_i)^p \big)^{1/p} \end{equation}

distance.minkowski(x, y, 3)
1.2599210498948732
np.cbrt(2) # cube root of 2
1.2599210498948734

The Mahalanobis distance. Note that this is defined in terms of an inverse covariance matrix. In fact, you can use it to compute distances based on the \(A\)-norm where \(A\) is any symmetric positive definite (SPD) matrix.

\begin{equation} d(x,y) = | x- y |_A = \sqrt{(x - y)^T A (x - y)} \end{equation}

A = np.random.randn(3,2)
A = A.T @ A
distance.mahalanobis(x, y, A)
1.5730650597233107

You can also weight many distances with a vector \(w\). The entry \(w_i\) will weight the comparison between the \(i\)th entries of \(x\) and \(y\). E.g. weighted euclidean distance \begin{equation} d(x, y; w) = \sqrt{\sum_i w_i(x_i - y_i)^2} \end{equation}

w = [1, 2]
for d in (distance.euclidean, distance.cityblock, lambda x, y, w : distance.minkowski(x, y, 3, w)):
    print(d(x, y, w))
1.7320508075688774
3.0
1.4422495703074083

Exercise

Compute a weighted euclidean distance using the Mahalanobis distance.

## Your code here

Pairwise Distances

In general, we often want to compute distances between many points all at once. This typically boils down to computing a matrix \(D\), where \(D_{i,j} = d(x_i, x_j)\).

Here’s a simple function that would do just that

def pairwise(X, dist=distance.euclidean):
    """
    compute pairwise distances in n x p numpy array X
    """
    n, p = X.shape
    D = np.empty((n,n), dtype=np.float64)
    for i in range(n):
        for j in range(n):
            D[i,j] = dist(X[i], X[j])
            
    return D
X = sample_circle(5)
pairwise(X)
array([[0.        , 1.93355563, 1.69804168, 1.84887352, 1.86117462],
       [1.93355563, 0.        , 1.30364128, 0.53174555, 0.65387795],
       [1.69804168, 1.30364128, 0.        , 0.79595082, 0.68934765],
       [1.84887352, 0.53174555, 0.79595082, 0.        , 0.12250706],
       [1.86117462, 0.65387795, 0.68934765, 0.12250706, 0.        ]])
pairwise(X, dist=lambda x, y : distance.minkowski(x, y, 3, [1.0, 2.0]))
array([[0.        , 1.90448822, 1.69516033, 1.78652978, 1.80889064],
       [1.90448822, 0.        , 1.62901861, 0.63161768, 0.77284517],
       [1.69516033, 1.62901861, 0.        , 1.00254106, 0.86383667],
       [1.78652978, 0.63161768, 1.00254106, 0.        , 0.14146722],
       [1.80889064, 0.77284517, 0.86383667, 0.14146722, 0.        ]])

There are a couple of inneficiencies above:

  1. D is symmetric (we do redundant computation)

  2. For some metrics, it can be more efficient to compute many distances at once

As an example of point 2, pairwise euclidean distances are \begin{equation} d(x_i, x_j) = \sqrt{(x_i - x_j)^T (x_i -x_j)}\ = \sqrt{x_i^T x_i - 2 x_i^T x_j + x_j^T x_j} \end{equation}

Let \(x^2\) denote the vector of length \(n\) where \(x_i^2 = x_i^T x_i\), and let \(1\) denote the vector of ones (of length \(n\)). Then \begin{equation} D = \sqrt{1 x^{2T} + x^2 1^T - 2 X X^T} \end{equation}

Thus, we can take advantage of BLAS level 3 operations to compute the distance matrix

def pairwise_euclidean(X):
    """
    Compute pairwise euclidean distances in X
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = np.outer(x2, one) + np.outer(one, x2) - 2*XXT
    return np.sqrt(D)

la.norm(pairwise_euclidean(X) - pairwise(X))
4.304351991299644e-16

We can also use BLAS syr2 for the rank-2 update:

print(la.blas.dsyr2.__doc__)
a = dsyr2(alpha,x,y,[lower,incx,offx,incy,offy,n,a,overwrite_a])

Wrapper for ``dsyr2``.

Parameters
----------
alpha : input float
x : input rank-1 array('d') with bounds (*)
y : input rank-1 array('d') with bounds (*)

Other Parameters
----------------
lower : input int, optional
    Default: 0
incx : input int, optional
    Default: 1
offx : input int, optional
    Default: 0
incy : input int, optional
    Default: 1
offy : input int, optional
    Default: 0
n : input int, optional
    Default: ((len(x)-1-offx)/abs(incx)+1 <=(len(y)-1-offy)/abs(incy)+1 ?(len(x)-1-offx)/abs(incx)+1 :(len(y)-1-offy)/abs(incy)+1)
a : input rank-2 array('d') with bounds (n,n)
overwrite_a : input int, optional
    Default: 0

Returns
-------
a : rank-2 array('d') with bounds (n,n)
def pairwise_euclidean_blas(X):
    """
    Compute pairwise euclidean distances in X 
    use syrk2 for rank-2 update
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = la.blas.dsyr2(1.0, x2, one, a=-2*XXT) # this only updates upper triangular part
    D = np.triu(D) # extract upper triangular part
    return np.sqrt(D + D.T)

la.norm(pairwise_euclidean_blas(X) - pairwise(X))
3.510833468576701e-16
import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))
pairwise : 0.12298965454101562 sec.
pairwise_euclidean : 0.0006000995635986328 sec.
pairwise_euclidean_blas : 0.0002651214599609375 sec.

Exercise

Write a version of pairwise_euclidean which uses Numba. How does this compare to the BLAS implementation?

## Your code here

pdist

You don’t need to implement these faster methods yourself. scipy.spatial.distance.pdist has built-in optimizations for a variety of pairwise distance computations. You can use scipy.spatial.distance.cdist if you are computing pairwise distances between two data sets \(X, Y\).

from scipy.spatial.distance import pdist, cdist
D = pdist(X)

The output of pdist is not a matrix, but a condensed form which stores the lower-triangular entries in a vector.

D.shape
(4950,)

to get a square matrix, you can use squareform. You can also use squareform to go back to the condensed form.

from scipy.spatial.distance import squareform

D = squareform(D)
D.shape
(100, 100)
squareform(D).shape
(4950,)

pdist takes in a keyword argument metric, which determines the distance computed

pdist(X, metric='cityblock')
array([2.56905948, 1.03224862, 3.14994077, ..., 2.14184371, 0.92547433,
       3.06648737])

Let’s compare to our implementations

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))
pairwise : 0.10127639770507812 sec.
pairwise_euclidean : 0.00022554397583007812 sec.
pairwise_euclidean_blas : 0.0003819465637207031 sec.
pdist : 0.0001583099365234375 sec.

If we form the squareform:

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

def sq_pdist(X, **kwargs):
    return squareform(pdist(X, **kwargs))

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, sq_pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))
pairwise : 0.1405320167541504 sec.
pairwise_euclidean : 0.0004215240478515625 sec.
pairwise_euclidean_blas : 0.0002703666687011719 sec.
sq_pdist : 0.0002224445343017578 sec.
D = cdist(X[:30],X[30:])
D.shape
(30, 70)

Exercise

Use pdist to compute ’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, and ‘euclidean’, distances on a random X. Are there noticeable differences in the time it takes to compute each distance?

## Your code here