The Boon of Dimensionality
High-dimensional space is mostly empty. Machine learning lives in the gap.
I recently watched Grant Sanderson’s (3blue1brown) video about volume of high-dimensional spheres. He made a note that the high-dimensional space is also of peculiar interest to the ML field. I knew of the curse of dimensionality, but I wanted to dig deeper and here is what I found, the other side; the boon of dimensionality.
I would recommend you watch Grant’s video for this. I will try my best here to explain the intuition behind the presented results.
Volume of a high-dimensional ball#
The result that Grant’s video centers on is that the volume of a ball with radius in dimensions is:
Where is the Gamma function. Plotting for from to we get something like this.
You can see the volume peaks at around , roughly 5.26 and at becomes vanishingly small, since the Gamma function denominator grows much faster than the numerator. Even more strange is how this volume is distributed. Consider a ball of radius where epsilon is an infinitesimal (read: vanishingly small value). The ratio of its volume to the full ball is
Since from Taylor expansion of . We can see that as , the ratio . This tells us that most of the already small volume is concentrated near its surface. This is a special case of a general principle called concentration of measure — the tendency for high-dimensional probability distributions to concentrate their mass in thin regions.
Equators#
Yet another interesting thing is about the equators, but first what is the equator of a high-dimensional ball. we can pick a coordinate say and all the points with lie in one hemisphere and in another; the boundary will be the equator. The other such slices will be (Pythagoras) and volume of such a slice will be , since we have fixed one dimension and are left with dimensions. This is again bounded by
This shows that the slice volume decreases exponentially as we move away from , which means most of the volume is also concentrated near . From the shell result and this we can see most of the points are concentrated on the equator.
Near Orthogonality#
The value is just , where is the unit vector across the chosen axis. If most points lie near the equator (small ), they are nearly orthogonal to . Since balls are rotationally symmetric, this can be said about any direction. So pick one point, make it the north pole and we are very certain the next point we pick will lie on the equator.
The Johnson-Lindenstrauss Lemma#
The near-orthogonality result has a famous companion that makes the capacity claim precise. In 1984, Johnson and Lindenstrauss proved the following 1:
Take any points in a high-dimensional space. You can project them into a space of just dimensions and preserve all pairwise distances up to a small factor .
While this is originally stated as a compression result, we can read it backwards to see that a dimensional space has the capacity to represent points with geometry intact. Note that this is exponential in .
This is not just a theoretical curiosity. Embedding models like word2vec 2 or sentence transformers 3 pack millions of concepts into ~768 dimensions. The JL capacity result tells us why this works: is an absurdly large number of near-orthogonal directions. Unrelated words get mapped to nearly orthogonal vectors and don’t interfere with each other. Analogy arithmetic (king - man + woman = queen) works because directions in this space are meaningful and there is room for them to not collide. Random projections exploit this directly too: you can project high-dimensional data down to dims via a random matrix and preserve distances, which is the basis of locality-sensitive hashing and streaming sketches.
Where next#
The curse of dimensionality tells us that data gets sparse and distances lose meaning as dimensions grow. The boon is the flip side: exponential capacity, near-orthogonality for free, and distances that survive projection. Same geometry, two readings. ML lives in the gap since real data is high-dimensional but structured (it sits on a manifold), and the boon wins over the curse.
There are two directions I want to explore from here. First is Cover’s theorem (1965), which says data that is not linearly separable in low dimensions becomes separable when mapped to higher dimensions. This is why kernel methods and neural network hidden layers work, they buy room. Second is superposition in neural networks, the idea that a network with neurons can represent far more than features by packing them into near-orthogonal directions, which connects directly to the geometry above. But those are for another post.
Footnotes#
-
W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into a Hilbert space,” Contemporary Mathematics, 26, 1984. ↩
-
T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781, 2013. ↩
-
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” arXiv:1908.10084, 2019. ↩