The Boon of Dimensionality

High-dimensional space is mostly empty. Machine learning lives in the gap.

Published:
Reading time: 4 min read
Tags:

I recently watched Grant Sanderson’s (3blue1brown) video about volume of high-dimensional spheres. He made a note that the high-dimensional space is also of peculiar interest to the ML field. I knew of the curse of dimensionality, but I wanted to dig deeper and here is what I found, the other side; the boon of dimensionality.

I would recommend you watch Grant’s video for this. I will try my best here to explain the intuition behind the presented results.

Volume of a high-dimensional ball#

The result that Grant’s video centers on is that the volume of a ball with radius rr in dd dimensions is:

Vd=πd/2Γ(d/2+1)rdV_d = \cfrac{\pi^{d/2}}{\Gamma(d/2 + 1)}{r^d}

Where Γ(x)\Gamma(x) is the Gamma function. Plotting VdV_d for dd from 11 to 5050 we get something like this.

Volume of a unit ball in d dimensions

You can see the volume peaks at around d=5d=5, roughly 5.26 and at d=50d=50 becomes vanishingly small, since the Gamma function denominator grows much faster than the numerator. Even more strange is how this volume is distributed. Consider a ball of radius (1ϵ)(1-\epsilon) where epsilon is an infinitesimal (read: vanishingly small value). The ratio of its volume to the full ball is

V(1ϵ)V(1)=(1ϵ)deϵd \cfrac{V(1-\epsilon)}{V(1)} = {(1-\epsilon)}^d \leq {e}^{-\epsilon d}

Since (1x)ex(1-x) \leq e^{-x} from Taylor expansion of exe^{-x}. We can see that as dd \rightarrow \infty, the ratio 0\rightarrow 0. This tells us that most of the already small volume is concentrated near its surface. This is a special case of a general principle called concentration of measure — the tendency for high-dimensional probability distributions to concentrate their mass in thin regions.

Equators#

Yet another interesting thing is about the equators, but first what is the equator of a high-dimensional ball. we can pick a coordinate say x1x_1 and all the points with 1x1<0-1 \leq x_1 < 0 lie in one hemisphere and 0<x11 0 < x_1 \leq 1 in another; the x1=0x_1 = 0 boundary will be the equator. The other such slices will be 1x12\sqrt{1 - {x_1}^2} (Pythagoras) and volume of such a slice will be (1x12)(d1)/2\propto {(1-{x_1}^2)}^{(d-1)/2} , since we have fixed one dimension and are left with d1d-1 dimensions. This is again bounded by

(1x12)(d1)/2e(d1)x12/2{(1-{x_1}^2)}^{(d-1)/2} \leq e^{-(d-1){x_1}^{2}/2}

This shows that the slice volume decreases exponentially as we move away from 00, which means most of the volume is also concentrated near x1=0x_1 = 0. From the shell result and this we can see most of the points are concentrated on the equator.

Near Orthogonality#

The value x1x_1 is just x,e1\langle x, e_1 \rangle, where e1e_1 is the unit vector across the chosen axis. If most points lie near the equator (small x1x_1), they are nearly orthogonal to e1e_1. Since balls are rotationally symmetric, this can be said about any direction. So pick one point, make it the north pole and we are very certain the next point we pick will lie on the equator.

Pairwise dot products of random unit vectors

The Johnson-Lindenstrauss Lemma#

The near-orthogonality result has a famous companion that makes the capacity claim precise. In 1984, Johnson and Lindenstrauss proved the following 1:

Take any nn points in a high-dimensional space. You can project them into a space of just O(ϵ2logn)O(\epsilon^{-2} \log n) dimensions and preserve all pairwise distances up to a small factor (1±ϵ)(1 \pm \epsilon).

While this is originally stated as a compression result, we can read it backwards to see that a dd dimensional space has the capacity to represent eΩ(d)e^{\Omega(d)} points with geometry intact. Note that this is exponential in dd.

This is not just a theoretical curiosity. Embedding models like word2vec 2 or sentence transformers 3 pack millions of concepts into ~768 dimensions. The JL capacity result tells us why this works: eΩ(768)e^{\Omega(768)} is an absurdly large number of near-orthogonal directions. Unrelated words get mapped to nearly orthogonal vectors and don’t interfere with each other. Analogy arithmetic (king - man + woman = queen) works because directions in this space are meaningful and there is room for them to not collide. Random projections exploit this directly too: you can project high-dimensional data down to O(logn)O(\log n) dims via a random matrix and preserve distances, which is the basis of locality-sensitive hashing and streaming sketches.

Where next#

The curse of dimensionality tells us that data gets sparse and distances lose meaning as dimensions grow. The boon is the flip side: exponential capacity, near-orthogonality for free, and distances that survive projection. Same geometry, two readings. ML lives in the gap since real data is high-dimensional but structured (it sits on a manifold), and the boon wins over the curse.

There are two directions I want to explore from here. First is Cover’s theorem (1965), which says data that is not linearly separable in low dimensions becomes separable when mapped to higher dimensions. This is why kernel methods and neural network hidden layers work, they buy room. Second is superposition in neural networks, the idea that a network with mm neurons can represent far more than mm features by packing them into near-orthogonal directions, which connects directly to the geometry above. But those are for another post.

Footnotes#

  1. W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” Contemporary Mathematics, 26, 1984.

  2. T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781, 2013.

  3. N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” arXiv:1908.10084, 2019.