Distances

Distances¶

A common task when dealing with data is computing the distance between two points.

We can use scipy.spatial.distance to compute a variety of distances.

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
import scipy.spatial.distance as distance

A data set is a collection of observations, each of which may have several features. We’ll consider the situation where the data set is a matrix X, where each row X[i] is an observation. We’ll use n to denote the number of observations and p to denote the number of features, so X is a \(n \times p\) matrix.

For example, we might sample from a circle (with some gaussian noise)

def sample_circle(n, r=1, sigma=0.1):
    """
    sample n points from a circle of radius r
    add Gaussian noise with variance sigma^2
    """
    X = np.random.randn(n,2)
    X = r * X / np.linalg.norm(X,axis=1).reshape(-1,1)
    return X + sigma * np.random.randn(n,2)

X = sample_circle(100)
plt.scatter(X[:,0], X[:,1])
plt.show()

We might also call a data set a “point cloud”, or a collection of points in some space.

Similarity, Dissimilarity, and Metric¶

A similarity between two points \(x, y\in X\) is a function \(s: X \times X \to \mathbb{R}\), where \(s\) is larger if \(x,y\) are more similar.

Cosine similarity is an an example of similarity for points in a real vector space. We can define it as \begin{equation} c(x,y) = \frac{x^T y}{|x|_2 |y|_2} \end{equation}

Note \(c(x,x) = 1\), and if \(x,y\) are orthogonal, then \(c(x,y) = 0\).

A dissimilarity between two points \(x,y \in X\) is a function \(d: X \times X \to \mathbb{R}_+\), where \(d\) is smaller if \(x,y\) are more similar. Sometimes, disimilarity functions will be called distances.

Cosine distance is an example of a dissimilarity for points in a real vector space. It is defined as \begin{equation} d(x,y) = 1 - c(x,y) \end{equation} Note \(d(x,x) = 0\), and \(d(x,y) = 1\) if \(x,y\) are orthogonal.

A metric is a disimilarity \(d\) that satisfies the metric axioms

\(d(x,y) = 0\) iff \(x=y\) (identity)
\(d(x,y) = d(y,x)\) (symmetry)
\(d(x,y) \le d(x,z) + d(z,y)\) (triangle inequality)

You are probably most familiar with the Euclidean metric \begin{equation} d(x,y) = |x - y|_2 \end{equation}

Exercise¶

Write functions for the cosine similarity, cosine distance, and euclidean distance between two numpy arrays treated as vectors.

## Your code here

import numpy as np

cos_sim = lambda x, y : np.dot(x,y) / np.sqrt(np.dot(x,x) * np.dot(y,y))
cos_dis = lambda x, y : 1 - cos_sim(x,y)
euc_dis = lambda x, y : np.linalg.norm(x - y, 2)

x = np.array([1,2,3])
y = np.array([-1,2,3])
print(cos_sim(x,x), cos_dis(x,x), euc_dis(x,x))
print(cos_sim(x,-x), cos_dis(x,-x), euc_dis(x,-x))
print(cos_sim(x,y), cos_dis(x,y), euc_dis(x,y))

1.0 0.0 0.0
-1.0 2.0 7.483314773547883
0.8571428571428571 0.1428571428571429 2.0

X = np.random.randn(100, 3)
X[0].T @ X[0]

2.8389823765022792

X @ X.T

array([[ 2.83898238,  1.16304231, -2.32948147, ..., -3.38312712,
         1.07988589, -0.24763674],
       [ 1.16304231,  8.2530741 , -1.27677546, ...,  0.17204936,
         3.0345908 , -2.39239586],
       [-2.32948147, -1.27677546,  2.23607767, ...,  3.43250478,
        -0.99733681,  0.47330705],
       ...,
       [-3.38312712,  0.17204936,  3.43250478, ...,  6.01431271,
        -0.77625925,  0.24180186],
       [ 1.07988589,  3.0345908 , -0.99733681, ..., -0.77625925,
         1.27487243, -0.85996136],
       [-0.24763674, -2.39239586,  0.47330705, ...,  0.24180186,
        -0.85996136,  0.795015  ]])

x = np.random.randn(1,3)
y = np.random.randn(1,3)

Solution:¶

# The function for the cosine similarity:
def cosimi(x,y):
    num = x.T @ y
    denum = la.norm(x) * la.norm(y)
    return num/denum

# The functions for the cosine distance:
def cosdis(x,y):
    num = x.T @ y
    denum = la.norm(x) * la.norm(y)
    cosimi = num/denum
    return 1-cosimi

# The function for the euclidean distance:
def euclidis(x,y):
    return la.norm(x-y)

Computing Distances¶

SciPy provides a variety of functionality for computing distances in scipy.spatial.distance.

x = [1.0, 0.0]
y = [0.0, 1.0]

distance.euclidean(x, y) # sqrt(2)

1.4142135623730951

distance.cosine(x, y)

1.0

the \(L_1\) / Manhattan / City block distance \begin{equation} d(x,y) = |x - y|_1 = \sum_i |x_i - y_i| \end{equation}

distance.cityblock(x, y)

2.0

The \(L_\infty\) / Chebyshev distance \begin{equation} d(x,y) = |x - y|_\infty = \max_i |x_i - y_i| \end{equation}

distance.chebyshev(x, y)

1.0

More generally, the Minkowski distance \begin{equation} d(x,y) = |x - y|_p = \big( \sum_i (x_i - y_i)^p \big)^{1/p} \end{equation}

distance.minkowski(x, y, 3)

1.2599210498948732

np.cbrt(2) # cube root of 2

1.2599210498948734

The Mahalanobis distance. Note that this is defined in terms of an inverse covariance matrix. In fact, you can use it to compute distances based on the \(A\)-norm where \(A\) is any symmetric positive definite (SPD) matrix.

\begin{equation} d(x,y) = | x- y |_A = \sqrt{(x - y)^T A (x - y)} \end{equation}

A = np.random.randn(3,2)
A = A.T @ A
distance.mahalanobis(x, y, A)

1.5730650597233107

You can also weight many distances with a vector \(w\). The entry \(w_i\) will weight the comparison between the \(i\)th entries of \(x\) and \(y\). E.g. weighted euclidean distance \begin{equation} d(x, y; w) = \sqrt{\sum_i w_i(x_i - y_i)^2} \end{equation}

w = [1, 2]
for d in (distance.euclidean, distance.cityblock, lambda x, y, w : distance.minkowski(x, y, 3, w)):
    print(d(x, y, w))

7320508075688774
0
4422495703074083

Exercise¶

Compute a weighted euclidean distance using the Mahalanobis distance.

## Your code here

Pairwise Distances¶

In general, we often want to compute distances between many points all at once. This typically boils down to computing a matrix \(D\), where \(D_{i,j} = d(x_i, x_j)\).

Here’s a simple function that would do just that

def pairwise(X, dist=distance.euclidean):
    """
    compute pairwise distances in n x p numpy array X
    """
    n, p = X.shape
    D = np.empty((n,n), dtype=np.float64)
    for i in range(n):
        for j in range(n):
            D[i,j] = dist(X[i], X[j])
            
    return D

X = sample_circle(5)
pairwise(X)

array([[0.        , 1.93355563, 1.69804168, 1.84887352, 1.86117462],
       [1.93355563, 0.        , 1.30364128, 0.53174555, 0.65387795],
       [1.69804168, 1.30364128, 0.        , 0.79595082, 0.68934765],
       [1.84887352, 0.53174555, 0.79595082, 0.        , 0.12250706],
       [1.86117462, 0.65387795, 0.68934765, 0.12250706, 0.        ]])

pairwise(X, dist=lambda x, y : distance.minkowski(x, y, 3, [1.0, 2.0]))

array([[0.        , 1.90448822, 1.69516033, 1.78652978, 1.80889064],
       [1.90448822, 0.        , 1.62901861, 0.63161768, 0.77284517],
       [1.69516033, 1.62901861, 0.        , 1.00254106, 0.86383667],
       [1.78652978, 0.63161768, 1.00254106, 0.        , 0.14146722],
       [1.80889064, 0.77284517, 0.86383667, 0.14146722, 0.        ]])

There are a couple of inneficiencies above:

D is symmetric (we do redundant computation)
For some metrics, it can be more efficient to compute many distances at once

As an example of point 2, pairwise euclidean distances are \begin{equation} d(x_i, x_j) = \sqrt{(x_i - x_j)^T (x_i -x_j)}\ = \sqrt{x_i^T x_i - 2 x_i^T x_j + x_j^T x_j} \end{equation}

Let \(x^2\) denote the vector of length \(n\) where \(x_i^2 = x_i^T x_i\), and let \(1\) denote the vector of ones (of length \(n\)). Then \begin{equation} D = \sqrt{1 x^{2T} + x^2 1^T - 2 X X^T} \end{equation}

Thus, we can take advantage of BLAS level 3 operations to compute the distance matrix

def pairwise_euclidean(X):
    """
    Compute pairwise euclidean distances in X
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = np.outer(x2, one) + np.outer(one, x2) - 2*XXT
    return np.sqrt(D)

la.norm(pairwise_euclidean(X) - pairwise(X))

4.304351991299644e-16

We can also use BLAS syr2 for the rank-2 update:

print(la.blas.dsyr2.__doc__)

a = dsyr2(alpha,x,y,[lower,incx,offx,incy,offy,n,a,overwrite_a])

Wrapper for ``dsyr2``.

Parameters
----------
alpha : input float
x : input rank-1 array('d') with bounds (*)
y : input rank-1 array('d') with bounds (*)

Other Parameters
----------------
lower : input int, optional
    Default: 0
incx : input int, optional
    Default: 1
offx : input int, optional
    Default: 0
incy : input int, optional
    Default: 1
offy : input int, optional
    Default: 0
n : input int, optional
    Default: ((len(x)-1-offx)/abs(incx)+1 <=(len(y)-1-offy)/abs(incy)+1 ?(len(x)-1-offx)/abs(incx)+1 :(len(y)-1-offy)/abs(incy)+1)
a : input rank-2 array('d') with bounds (n,n)
overwrite_a : input int, optional
    Default: 0

Returns
-------
a : rank-2 array('d') with bounds (n,n)

def pairwise_euclidean_blas(X):
    """
    Compute pairwise euclidean distances in X 
    use syrk2 for rank-2 update
    """
    XXT = X @ X.T
    x2 = np.diag(XXT)
    one = np.ones(X.shape[0])
    D = la.blas.dsyr2(1.0, x2, one, a=-2*XXT) # this only updates upper triangular part
    D = np.triu(D) # extract upper triangular part
    return np.sqrt(D + D.T)

la.norm(pairwise_euclidean_blas(X) - pairwise(X))

3.510833468576701e-16

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))

pairwise : 0.12298965454101562 sec.
pairwise_euclidean : 0.0006000995635986328 sec.
pairwise_euclidean_blas : 0.0002651214599609375 sec.

Exercise¶

Write a version of pairwise_euclidean which uses Numba. How does this compare to the BLAS implementation?

## Your code here

`pdist`¶

You don’t need to implement these faster methods yourself. scipy.spatial.distance.pdist has built-in optimizations for a variety of pairwise distance computations. You can use scipy.spatial.distance.cdist if you are computing pairwise distances between two data sets \(X, Y\).

from scipy.spatial.distance import pdist, cdist

D = pdist(X)

The output of pdist is not a matrix, but a condensed form which stores the lower-triangular entries in a vector.

D.shape

(4950,)

to get a square matrix, you can use squareform. You can also use squareform to go back to the condensed form.

from scipy.spatial.distance import squareform

D = squareform(D)
D.shape

(100, 100)

squareform(D).shape

(4950,)

pdist takes in a keyword argument metric, which determines the distance computed

pdist(X, metric='cityblock')

array([2.56905948, 1.03224862, 3.14994077, ..., 2.14184371, 0.92547433,
       3.06648737])

Let’s compare to our implementations

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))

pairwise : 0.10127639770507812 sec.
pairwise_euclidean : 0.00022554397583007812 sec.
pairwise_euclidean_blas : 0.0003819465637207031 sec.
pdist : 0.0001583099365234375 sec.

If we form the squareform:

import time

# generate data set
n = 100
d = 2

X = np.random.randn(n, d)

def sq_pdist(X, **kwargs):
    return squareform(pdist(X, **kwargs))

for d in (pairwise, pairwise_euclidean, pairwise_euclidean_blas, sq_pdist):
    t0 = time.time()
    D = d(X)
    t1 = time.time()
    print("{} : {} sec.".format(d.__name__, t1 - t0))

pairwise : 0.1405320167541504 sec.
pairwise_euclidean : 0.0004215240478515625 sec.
pairwise_euclidean_blas : 0.0002703666687011719 sec.
sq_pdist : 0.0002224445343017578 sec.

D = cdist(X[:30],X[30:])
D.shape

(30, 70)

Exercise¶

Use pdist to compute ’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, and ‘euclidean’, distances on a random X. Are there noticeable differences in the time it takes to compute each distance?

## Your code here

Geometry

Nearest Neighbors

Scientific Computing with Python

Distances

Contents

Distances¶

Similarity, Dissimilarity, and Metric¶

Exercise¶

Solution:¶

Computing Distances¶

Exercise¶

Pairwise Distances¶

Exercise¶

`pdist`¶

Exercise¶

Scientific Computing with Python

Distances

Contents

Distances¶

Similarity, Dissimilarity, and Metric¶

Exercise¶

Solution:¶

Computing Distances¶

Exercise¶

Pairwise Distances¶

Exercise¶

pdist¶

Exercise¶

`pdist`¶