What Makes a Good Latent Space? The Welch Bound and the Simplex

June 2, 2026 · 14 min read

#ml #contrastive-learning #representation-learning #welch-bound #frame-theory #neural-collapse #simplex-etf #tight-frames #latent-space #packing

Part 5 of 8Geometry of Representations

1Activations Are Bad for Geometry
2Opposite Is Not Different: The Cosine-Similarity Bug in CLIP and Contrastive Learning
3Not All Infinities Are Equal: The Cross-Entropy Asymmetry Behind Hallucination
4Untangling the Moons: A Visual History of Contrastive Learning
5What Makes a Good Latent Space? The Welch Bound and the Simplexyou are here
6Latent on the Spectrum: Why Cats Sit Closer to Dogs Than to Cars
7The Three States of Information
8Distillation Is a Geometry, Not an Answer Key

Runnable JAX companionAuditing Latent Space Geometry in JAXPrefer to read the code? This post has a hands-on JAX / Flax NNX implementation.Open the JAX companion

How many meanings can you pack into a fixed number of dimensions before they start talking over each other? That is the real story of a good latent space: not “features” in the abstract, but a crowded communication channel.

In many normalized embedding systems (classifier heads, contrastive encoders, retrieval models) every class, image, caption, or document is compared through a direction. At inference time, the model listens to those directions and decides which ones sound alike. If two directions are too correlated, they interfere. If every direction collapses into the same corner, nothing can be told apart. If the directions waste dimensions, the space becomes fragile.

A latent space is a codebook. (Norms and continuous semantics matter too; the codebook lens is the part that the geometry below explains.)

And the packing question has a floor. Lloyd Welch proved it in 1974 for signal codes, long before anyone was training contrastive encoders. Several corners of modern representation learning keep rediscovering closely related geometry: when there is room, class means arrange as a regular simplex; when there is not, the best codebook presses against the Welch bound.

This post builds that geometry from the failure mode up.

The mystery appears immediately

What does the failure actually look like? Below, a tiny encoder trains live in your browser. Each colour is a class; each dot is an example; the larger rings are class centroids. Press Play.

With repulsion on, the classes tighten and spread into a clean codebook. Toggle repulsion off, re-roll, and run it again. The same encoder now takes the cheap solution: every class lands on the same point.

That pile-up is collapse. It is the first villain of representation learning.

The interesting part is not merely that repulsion prevents collapse. The interesting part is the shape it chooses. Why do the centroids settle at those angles? Why not push every class to the opposite side of every other class? Why does the good solution look so rigid?

To answer that, we need three constraints. A good latent space must be tight, separated, and full-rank. The drama is that these constraints do not automatically agree.

Constraint 1: collapse what should match

Neural networks do not begin with an evenly spread sphere. Push data through a random, untrained ReLU network, normalize the outputs, and the points bunch into a cone. Add depth and the cone gets narrower.

Training has to pull examples of the same class together. This is the first pressure: alignment.

Alignment is necessary, but it is not enough. If all the loss says is “matching things should be close,” then every class can satisfy it by becoming the same point. Collapse is not a weird bug. It is the natural endpoint of alignment without a counterforce.

Constraint 2: leave a margin

The counterforce is separation. But separation is not a decorative choice. It buys robustness.

Below, two noisy classes sit at an angle you control. Before dragging the slider, make a bet: at what angle does the same noise stop being dangerous?

The larger the angle, the more noise a point can absorb before crossing the boundary. In the codebook view, margin is low crosstalk: one class’s signal should not be easily confused for another’s.

So the naive goal seems obvious: maximize every pairwise angle.

That is where the story gets interesting, because the obvious goal is wrong.

Constraint 3: do not waste dimensions

An embedding can look separated while secretly living in too few dimensions. If a 100-dimensional latent space collapses onto a line, 99 dimensions have become dead capacity. The codebook may have large distances along that line, but it has lost room to express independent distinctions.

The measurement to watch is effective rank: how many directions the cloud actually uses.

Now the conflict is visible. Separation wants large angles. Rank wants independent directions. Those are not the same thing.

The false answer: make everything opposite

What could be more different than opposite? On the cosine scale, -1 looks like “maximally different.” But two antiparallel vectors span one line. They are the same direction with a sign flip. This is the cosine-similarity bug from Opposite Is Not Different, returning now as a question about the whole codebook rather than a single pair.

Before you drag the green arrow, guess which configuration is more useful for a representation: 180 degrees apart, or 90 degrees apart?

snap to:

drag the green vector tip

θ—

cos θ—

cos²θ—

1 − cos²θ—

dim span(u, v)—

—

Two unit vectors: drag the green tip. Antiparallel (cos = -1): maximally opposite, but the pair collapses onto a 1-D line. Orthogonal (cos = 0): independent, spanning the full plane. Watch the span readout drop from 2 to 1 at opposition. Pairwise distance alone can reward a rank collapse.

Opposition wins a single pairwise contest by sacrificing a dimension. Orthogonality uses the space. For two centered class prototypes, opposition is the simplex solution, though a representation that must also preserve other factors of variation can still be too poor when everything collapses onto one line. For three or more classes, blindly chasing antipodes becomes impossible and, worse, conceptually misleading.

The correct question is not “how do I make every pair as opposite as possible?” It is:

How do I make all pairs as evenly separated as possible while keeping the whole codebook full-rank?

That question has a beautiful answer.

If the classes fit, what shape wins?

When there is room, every pair can get the same deal. For centered class means, room means $d \ge C - 1$ : all pairs can then share one common angle. No pair gets special treatment. No dimension is wasted.

The pairwise cosine is

-\frac{1}{C-1}.

That number is the fingerprint of the regular simplex. Three classes land 120 degrees apart. Four classes form a tetrahedron. As the number of classes grows, the cosine approaches 0: almost orthogonal, not opposite.

classes n n = 10 · cos θ ≈ −0.111

The simplex angle, cos = -1/(C-1), as a function of the number of classes. C = 2 gives -1, the only case where opposition is optimal. C = 3 gives -0.5 (120 degrees); C = 4 gives -0.33; by C = 50 it is -0.02, essentially orthogonal. The simplex is the maximally even centered arrangement.

This is the first payoff. The “good” centroids from the opening animation were not merely spreading out. They were moving toward the only centered arrangement where every class gets the same deal.

The rotating sphere below shows the same idea in 3-D. With room for orthogonal axes, the vectors can avoid interference entirely. With four centered class means, the tetrahedron appears. Add more vectors than dimensions and the sphere runs out of room.

This is where the radio engineers enter.

If they do not fit, Welch tells you the crosstalk floor

Imagine many users sharing one radio channel. Each user gets a signature vector. If two signatures are correlated, the receiver hears interference. You want as many users as possible, each with as little crosstalk as possible.

That was the signal-processing version of our latent-space problem. In 1974, Lloyd Welch proved a lower bound on the maximum cross-correlation among such signals.

Play with the 2-D version. With two users, the signatures can be orthogonal. Add a third user, and zero interference is impossible.

For $C$ unit vectors in $d$ dimensions, with $C > d$ , the worst absolute inner product must obey

\max_{i \ne j} |\langle x_i, x_j\rangle| \ge \sqrt{\frac{C-d}{d(C-1)}}.

That is the Welch bound. It says: once the codebook is overloaded, some amount of crosstalk is unavoidable. You can move the vectors around, train longer, tune the optimizer, or change the loss temperature. You cannot beat the floor imposed by the dimension.

The bound is universal, but equality is special. When an equiangular tight frame exists for that $(C, d)$ , it attains the floor: equiangular means the crosstalk is shared evenly instead of dumped onto a few unlucky pairs, and tight means the vectors use the whole space evenly, exactly the compromise the three constraints demanded. Such frames do not exist for every $(C, d)$ ; when none does, the best packing can only approach the same floor rather than sit on it.

Now the post’s title should feel less like jargon. A good latent space is a good overloaded codebook.

Why descent can actually reach the good codebook

There is still a practical worry. Maybe the best arrangement exists, but gradient descent almost never finds it.

The frame-potential story explains why that worry is smaller than it looks. Benedetto and Fickus introduced the frame potential as an energy for measuring how far a set of unit vectors is from being evenly spread; modern summaries of their theorem emphasize that this landscape has no spurious local minimizers for the tight-frame objective. In plain language: if you descend the right energy, the bad traps are not the main obstacle. This is a statement about the idealized geometry, free unit vectors descending the frame potential, not about a full neural network’s loss surface. It means the codebook geometry itself is not the source of the trap, not that the trained network has no bad minima of its own.

Watch six independent random starts descend together.

The starts disagree. The floor does not. That is why the geometry is more than an existence theorem; it is an attractor.

Neural networks rediscover the same shape

What happens to a classifier trained long past zero training error? The natural guess is drift into ever more idiosyncratic structure. When Papyan, Han, and Donoho looked in 2020, they found the opposite. The late phase simplified:

examples from the same class collapsed toward their class mean;
class means converged toward a simplex equiangular tight frame;
classifier weights aligned with those means;
prediction approached nearest-class-mean classification.

They called this neural collapse.

Read through the codebook story, neural collapse is not a strange coincidence. The network has learned a low-crosstalk codebook for the classes. Within-class variation vanishes; class means arrange as evenly as the dimension allows; the classifier becomes a decoder for that codebook.

The surprise is not that neural networks found geometry. The surprise is that the geometry is so simple.

Contrastive learning is codebook design with two forces

And what about training without labels? Contrastive learning has no class means to collapse, yet it faces the same crowded channel. Wang and Isola gave it a clean two-force description:

alignment: matching views should be close;
uniformity: normalized features should spread over the hypersphere.

That is the same plot with different names. Alignment tightens each class or positive pair. Uniformity prevents collapse and lowers crosstalk by using the sphere.

Different contrastive methods build different machinery around those two forces: in-batch negatives, memory queues, temperatures, predictor heads, variance penalties, covariance penalties. Under the machinery is the same geometric demand:

make matching things share a code, and make nonmatching codes interfere as little as the dimension allows.

There is one wrinkle worth keeping. A naive pairwise cosine push treats -1 as the ideal value for every negative pair. That is the false answer from earlier. The good packing emerges because all pairs cannot be antipodal at once; the impossible pairwise request gets bent by the global constraint into a simplex or a Welch-bound frame. The geometry of the whole codebook matters more than the target for any one pair.

The answer

A good latent space is not merely “well separated.” It is a codebook satisfying three constraints at once:

alignment: examples that should match collapse toward the same code;
margin: different codes have enough angular room to resist noise;
isotropy (tightness): the codes use the available dimensions evenly, instead of hiding on a line, plane, or cone.

When centered class means fit, the solution is the regular simplex, with pairwise cosine $-1/(C-1)$ . When too many codes share too few dimensions, the Welch bound sets the unavoidable crosstalk floor, an equiangular tight frame attains it where one exists, and otherwise the best packing can only approach the same ideal.

The story is compelling because it is not just an ML story. Radio engineers ran into the same geometry while designing signatures for crowded channels. Frame theorists proved what the even configurations minimize. Classifiers rediscover the simplex during neural collapse. Contrastive learning balances alignment and uniformity to chase the same sphere-packing compromise.

The good latent space is the arrangement where the codes stop shouting over each other.

Cite as

Bouhsine, T. (2026, June 2). What Makes a Good Latent Space? The Welch Bound and the Simplex. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/welch-bound-good-latent-space/

BibTeX

@misc{bouhsine2026welchboundgoodlatentspace,
  author       = {Bouhsine, Taha},
  title        = {What Makes a Good Latent Space? The Welch Bound and the Simplex},
  year         = {2026},
  month        = {jun},
  howpublished = {\url{https://tahabouhsine.com/blog/welch-bound-good-latent-space/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

References

Tammes, P. M. L. (1930). On the Origin of Number and Arrangement of the Places of Exit on the Surface of Pollen Grains. Recueil des Travaux Botaniques Néerlandais 27, 1–84.
Welch, L. R. (1974). Lower Bounds on the Maximum Cross Correlation of Signals. IEEE Transactions on Information Theory 20(3), 397–399.doi:10.1109/TIT.1974.1055219
Conway, J. H., Hardin, R. H., Sloane, N. J. A. (1996). Packing Lines, Planes, etc.: Packings in Grassmannian Spaces. Experimental Mathematics 5(2), 139–159.
Benedetto, J. J., Fickus, M. (2003). Finite Normalized Tight Frames. Advances in Computational Mathematics 18(2–4), 357–385.
Strohmer, T., Heath, R. W. (2003). Grassmannian Frames with Applications to Coding and Communication. Applied and Computational Harmonic Analysis 14(3), 257–275.
Sustik, M. A., Tropp, J. A., Dhillon, I. S., Heath, R. W. (2007). On the Existence of Equiangular Tight Frames. Linear Algebra and its Applications 426(2–3), 619–635.
Papyan, V., Han, X. Y., Donoho, D. L. (2020). Prevalence of Neural Collapse During the Terminal Phase of Deep Learning Training. Proceedings of the National Academy of Sciences 117(40), 24652–24663.doi:10.1073/pnas.2015509117
Wang, T., Isola, P. (2020). Understanding Contrastive Representation Learning Through Alignment and Uniformity on the Hypersphere. ICML 2020 (PMLR 119), 9929–9939.arXiv:2005.10242

The mystery appears immediately#

Constraint 1: collapse what should match#

Constraint 2: leave a margin#

Constraint 3: do not waste dimensions#

The false answer: make everything opposite#

If the classes fit, what shape wins?#

If they do not fit, Welch tells you the crosstalk floor#

Why descent can actually reach the good codebook#

Neural networks rediscover the same shape#

Contrastive learning is codebook design with two forces#

The answer#