Untangling the Moons: A Visual History of Contrastive Learning

May 26, 2026 · 27 min read

#ml #contrastive #embeddings #kernels #interpretability #contrastive-learning #infonce #simclr #clip #siglip #supcon #triplet-loss #self-supervised-learning #representation-learning

Twenty years of contrastive learning, eight losses, eight datasets — and the question of when a loss should stop pushing.

Contrastive learning is the standard recipe for turning raw data into a usable embedding space, and the recipe has been rewritten roughly every three years. As a framing device — and it is a framing device, the real history is messier — each one reads as a response to the previous one’s failure mode: triplet relativised pair contrastive’s brittle global margin; InfoNCE replaced triplet’s fragile sampling with a softmax (though it arrived from predictive coding and mutual-information estimation, not as a triplet patch); CLIP scaled InfoNCE to multimodal data; SupCon generalised it to many positives; alignment-and-uniformity decomposed what it was doing; and SigLIP was the first to question whether the negatives target was right at all.

What most of the lineage glosses over is a question of geometry. They push different-class pairs as far apart as possible, and on the unit sphere the per-pair gradient points at cosine $-1$ — diametrically opposite. But the maximally discriminative arrangement of $k$ classes on the sphere is not all-pairs-at- $-1$ (which is impossible for $k > 2$ ); it is the regular simplex, with every inter-class cosine equal to $-1/(k-1)$ . That target is $-1$ only for two classes and rushes toward orthogonality as $k$ grows — $-\tfrac12$ at three classes, $-0.11$ at ten, essentially $0$ by ImageNet scale. So for any large-vocabulary problem the right target is approximately orthogonality, and a loss whose gradient keeps hauling negatives toward $-1$ is spending effort on separation the geometry neither needs nor can deliver. I made the precise case — including why $n=2$ is the one case where opposition is genuinely correct — in Opposite Is Not Different. Here the demonstration is mechanical: in 2D, with the points being the embeddings, you can watch where each loss’s path goes and where it stops.

A caveat worth stating up front, because the 2D demos make it vivid in a slightly misleading way. With two classes — the two-moons case most of these panels use — the simplex optimum is $-1$ , so there is nothing to overshoot; opposition is correct. The geometric argument bites for many classes, where it can only be shown in dimensions too high to animate. What the 2D panels show faithfully is the dynamics — which losses freeze, which never stop, which flatten at a sensible place — and those dynamics are the durable lesson here.

The fastest way to feel any of this is to skip the encoder entirely. Take 60 points in 2D, label them by class, and treat the positions themselves as the embeddings. Run gradient descent on the loss of your choice. Eight losses, eight geometries, eight failure modes — each visible in under a minute.

This post walks the lineage by idea, not strictly by date (CLIP and SupCon are contemporaneous — 2020–21 — and ordered here by the conceptual thread, not their arXiv stamps). Each loss gets an interactive explorer locked to that loss; the default preset starts from random positions and random labels so the loss has to impose all the geometry itself, and the rest of the presets surface the loss’s named pathologies on demand. At the end, all eight race on the same dataset, from the same initial points, at the same step counter.

The setup

The left panel in every explorer is the embedding space; the right is the loss curve and a running nearest-centroid classification accuracy. Losses that use cosine similarity project to the unit circle every step — the dashed reference circle appears for those. Losses that use Euclidean distance leave the points free.

Eight datasets are available in the dropdown:

random (default) — uniform positions, random labels, no spatial signal. The loss must impose everything.
random (4 classes) — same, four-way.
two-moons — Hadsell’s original test case. Two interlocking half-circles.
overlapping blobs — heavy class overlap.
concentric rings — same angular distribution, different radii. Diagnoses cosine-based losses.
four classes / eight classes — multi-class stress.
two spirals, imbalanced moons, noisy moons — adversarial settings.

The default preset uses random data for every explorer because pre-organised datasets give the loss too much of a head start. What you want to see is the law — the configuration each loss converges to — and a structured initial state lets the loss get away with doing very little.

1. The original: Hadsell, Chopra, and LeCun, 2006

The modern contrastive era opens with Hadsell, Chopra, and LeCun’s Dimensionality Reduction by Learning an Invariant Mapping — a siamese network for face verification with what is now the canonical pair loss. For every pair of points you know whether they share a label. Pull positives together with a quadratic penalty; push negatives apart until they reach a margin $m$ , and then stop:

\mathcal{L}_{\text{pair}}(i, j) \;=\; \begin{cases} \|z_i - z_j\|^2 & y_i = y_j \\ \bigl[\,m - \|z_i - z_j\|\,\bigr]_+^2 & y_i \neq y_j \end{cases}

The margin does all the work. Without it the repulsion has no scale and the embedding blows up. With it the loss is satisfied as soon as negatives are far enough — and then the gradient vanishes. Pair contrastive knows when to stop, which is both its virtue and its problem.

preset data τ

speed

iter 0 · acc 0%

Pair contrastive (Hadsell, Chopra, LeCun, 2006). Default preset starts from random data and watches the loss impose its geometry from scratch. Other presets surface the failure modes: too-small margin, too-large margin, imbalanced classes.

The named failure modes:

Margin too small. No reward for separating classes far. Loss drops to zero almost immediately; accuracy stays near chance. The loss curve looks healthy and the embedding is useless. The literature’s name for this — trivial solution — already appears in Hadsell’s paper.
Margin too large. Repulsion never saturates. Negatives keep getting pushed further forever; the embedding norm grows without bound; the loss curve never stabilises. In a real network this is where you get gradient explosions.
Class imbalance. Every pair contributes equally to the gradient. With an 80/20 split the majority class dominates and the minority gets dragged. Standard fix is pair re-weighting; vanilla pair contrastive does not.

By 2014 the problem with pair contrastive was visible in production face-recognition systems: $m$ is a global hyperparameter that wants to be different for different parts of the data manifold. The fix would be to make the comparison relative — to ask not is this pair close enough? but is the positive closer than the negative, by how much? That is the next loss.

2. The relativisation: FaceNet, 2015

Schroff, Kalenichenko, and Philbin’s FaceNet reformulated the problem in relative terms. Instead of asking two separate questions about positives and negatives, ask one combined question: is the negative farther than the positive, by at least margin $m$ ?

\mathcal{L}_{\text{trip}}(a, p, n) \;=\; \bigl[\,\|z_a - z_p\|^2 - \|z_a - z_n\|^2 + m\,\bigr]_+

Only triplets that violate the inequality contribute gradient. The hope was that this makes training focus naturally on the hardest examples, automatically. The reality, which the FaceNet paper itself discovered and the next decade of work spent fighting, is that most random triplets are easy — and a triplet that is easy contributes no gradient at all. In a batch of $B$ examples there are $O(B^3)$ candidate triplets and almost all of them are silent.

preset data τ

speed

iter 0 · acc 0%

Triplet loss (Schroff, Kalenichenko, Philbin, 2015). The 'easy triplets' preset shows the silent-death failure mode that motivated a decade of hard-negative-mining literature.

The named failure modes:

Easy triplets. At initialisation, random triplets are usually satisfied trivially. The loss is near zero from step one and no learning happens. FaceNet’s own response was semi-hard mining: pick the negative that violates the triplet by the smallest positive amount. By 2017 there were online miners, batch-hard miners, miner schedules, and an industry of triplet-mining infrastructure. The point of all of it was to compensate for the fact that the triplet loss, left alone, has no gradient on most of its input.
Aggressive margin. Set the margin bigger than the typical inter-class distance and no triplet is ever satisfied. The embedding wanders stochastically.
Class imbalance. Random anchor sampling means most anchors come from the majority. Balanced batch sampling required.

Pair contrastive and triplet together define what the literature later named the margin family — losses whose gradient vanishes once a margin condition is satisfied. The margin family’s appeal is that it knows when to stop. Its problem is that it stops too soon, and selecting which examples it should think about is its own engineering project. By 2018 the field was ready to give up on margins entirely.

3. The softmax turn: van den Oord and SimCLR, 2018–2020

Van den Oord, Li, and Vinyals, working on contrastive predictive coding, replaced the margin with a softmax. For an anchor $a$ with one positive $p$ and a batch of $N$ negatives:

\mathcal{L}_{\text{InfoNCE}}(a) \;=\; -\log \frac{\exp(\mathrm{sim}(z_a, z_p) / \tau)}{\sum_{k \neq a} \exp(\mathrm{sim}(z_a, z_k) / \tau)}

Three structural changes from triplet. All negatives at once: every other point participates as a negative, weighted by its similarity to the anchor; hard negatives get most of the gradient automatically, eliminating the explicit miner. No hard margin: the softmax has support everywhere, so the gradient never vanishes — there is always an incentive to spread the negatives a bit wider. Embeddings on a sphere: cosine similarity ignores the norm.

The 2020 wave — SimCLR, MoCo, CLIP — all run on variants of this loss. The name InfoNCE comes from the loss being a variational lower bound on mutual information $I(X; Y)$ — though the bound is capped at $\log N$ and so is loose at high MI (Poole et al., 2019); the name NT-Xent (normalised-temperature cross-entropy) comes from the SimCLR paper. They are the same loss.

preset data τ

speed

iter 0 · acc 0%

InfoNCE / NT-Xent (van den Oord et al., 2018; Chen et al., 2020). The temperature τ is the dominant hyperparameter — the failure-mode presets show its two extremes.

The named failure modes are almost all about the temperature $\tau$ :

τ too low. The softmax becomes peaky. Gradients are dominated by the single hardest negative each step, and any noisy step destabilises the embedding. The loss curve becomes saw-toothed. Wang & Isola (next section) prove this corresponds to over-emphasising uniformity at the expense of alignment.
τ too high. The softmax flattens; all negatives contribute roughly equally; the gradient is weak. Convergence is glacial.
Concentric rings. Cosine similarity throws away the norm, so two concentric rings — different radii, same angles — become identical after projection. Accuracy at chance. This is not a defect of InfoNCE; it is a defect of cosine, and any loss that lives on the sphere inherits it.

I argued at length in Attention is Explainable Because it is a Kernel that the softmax over similarities in attention is mathematically a Nadaraya–Watson smoother — the kernel operator. InfoNCE is the same operator, the same softmax, pointed at a different objective: optimize positions so the kernel-weighted distribution of labels around each anchor matches its own label.

By 2019 the softmax family had a different problem. It was designed for self-supervised learning where each anchor has exactly one positive: the augmented view of the same image. When labels exist, every other example with the same label is also a positive — and InfoNCE was throwing away that signal one sample at a time. Before the labelled fix appeared, the loss took a detour through multimodality.

4. Multimodal scaling: CLIP, 2021

Radford et al.’s Learning Transferable Visual Models From Natural Language Supervision — the CLIP paper — was the loudest deployment of InfoNCE the field has seen. CLIP trains an image encoder and a text encoder jointly on $400$ million image-caption pairs, with a single objective: each image’s embedding should be closest, in cosine distance, to its caption’s embedding, out of a batch of $N$ candidates. The loss is symmetrised InfoNCE: image-as-anchor and text-as-anchor, averaged.

\mathcal{L}_{\text{CLIP}} \;=\; \tfrac{1}{2}\bigl(\mathcal{L}_{\text{img}\rightarrow\text{txt}} + \mathcal{L}_{\text{txt}\rightarrow\text{img}}\bigr)

Each side is an InfoNCE term. The symmetrisation is what makes the trained embedding bidirectional: a query in either modality retrieves nearest neighbours in the other.

preset data τ

speed

iter 0 · acc 0%

CLIP (Radford et al., 2021). In a single-modality 2D demo CLIP is symmetrised InfoNCE — the gradient flows both directions of each anchor↔positive pair. The geometry it converges to is the same as InfoNCE; the gradient signal is doubled per step.

CLIP’s failure modes are InfoNCE’s failure modes, amplified by scale. CLIP doesn’t freeze the temperature — it learns the logit scale as a log-parameterised scalar (initialised around $\tau = 0.07$ and clipped to keep logits bounded) — but that only relocates the sensitivity rather than removing it: the learned scale is still the main throttle on how hard negatives keep separating, and there is no target that says stop. The spherical geometry compounds it: the embedding dimension ( $512$ in the public release) was chosen for matrix-multiplication throughput, not for the simplex bound that $\mathrm{ImageNet}$ -scale class counts would need; and the literal compute cost, measured in GPU-years, is staggering — much of it spent on the gradient bookkeeping to keep a $32,768 \times 32,768$ similarity matrix on-device.

The geometric tension I argued in Opposite Is Not Different is at its most consequential here. CLIP’s loss has its per-pair minimum at cosine $-1$ , but the space cannot accommodate $\binom{N}{2}$ mutually antiparallel vectors when $N \gg \dim$ — only one antipodal pair fits per axis. So the gradient pulls toward a target it can never reach for most pairs, and the model approximates the simplex only by averaging over a huge number of negatives. The cost is paid in batch size.

5. Reintroducing supervision: SupCon, 2020

Khosla et al.’s Supervised Contrastive Learning generalised InfoNCE to multiple positives — every other example with the same class label — by averaging the InfoNCE term over the positive set $P(a)$ :

\mathcal{L}_{\text{SupCon}}(a) \;=\; \frac{-1}{|P(a)|}\sum_{p \in P(a)} \log \frac{\exp(\mathrm{sim}(z_a, z_p) / \tau)}{\sum_{k \neq a} \exp(\mathrm{sim}(z_a, z_k) / \tau)}

The effect is dramatic. Where InfoNCE pulls each anchor toward a positive each step, SupCon pulls it toward the centroid of all positives. Classes collapse to tight clusters on the sphere, much faster than InfoNCE, and the paper’s headline result was beating plain cross-entropy on ImageNet classification accuracy with a two-stage pre-train-then-fine-tune recipe.

preset data τ

speed

iter 0 · acc 0%

Supervised Contrastive (Khosla et al., 2020). Default preset uses four classes from random data — the four-class case is where SupCon's geometry is sharpest. The 'representation collapse' preset surfaces the cost of that sharpness.

The named failure modes of SupCon are not about under-fitting. They are about being too good a classifier:

Representation collapse. With low $\tau$ and many positives, each class collapses to a single point on the circle. Linearly separable to perfection; but anything you cared about within a class — pose, style, lighting, the things downstream non-classification tasks need — is gone. SupCon is approximately optimal for downstream classification and approximately worst for downstream tasks that need within-class variation. This within-class collapse, and the simplex arrangement the class means settle into, is neural collapse — named by Papyan, Han & Donoho (2020) for cross-entropy and connected to SupCon by Graf et al. (2021). Whether you like it depends on what you want the embedding for.
Imbalanced classes. $|P(a)|$ ranges across classes; the per-anchor average is on a different scale for each class; the softmax denominator is dominated by the majority. Re-weighting or sub-sampling required.
Label noise. SupCon trusts its labels completely. A flipped label becomes a positive pulled toward the wrong centroid, dragging the cluster boundary every step. Robust-SupCon variants exist; vanilla does not handle this.

By 2020 InfoNCE and SupCon between them dominated practical embedding work, and the field had a new conceptual problem. The losses worked, but it was unclear why. The softmax form bundles several things together — what is the right way to think about what these losses are doing in the limit?

6. The theoretical decomposition: Wang and Isola, 2020

Wang and Isola’s Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere proved that InfoNCE, in the limit of infinite negatives, is doing two separable things:

\mathcal{L}_{\text{align}} \;=\; \mathbb{E}_{(x,y) \sim p_+}\,\bigl\|z_x - z_y\bigr\|_2^2

\mathcal{L}_{\text{uniform}} \;=\; \log \mathbb{E}_{x, y \sim p_\text{data}} \exp\!\bigl(-t\,\|z_x - z_y\|_2^2\bigr)

with total loss $\mathcal{L}_{\text{align}} + \lambda \mathcal{L}_{\text{uniform}}$ . The uniformity term is the log-MMD of a Gaussian kernel — minimising it is the same as maximising the entropy of the embedding distribution over the sphere.

The decomposition is more than aesthetic. It says the two jobs are independent. Writing them as separate terms means you can balance them by tuning $\lambda$ , diagnose which one is misbehaving from the loss curve, and — in principle — replace either one with a different functional and retain interpretability.

preset data τ

speed

iter 0 · acc 0%

Alignment + Uniformity (Wang & Isola, 2020). The λ = 0 preset shows what happens with no uniformity term: representation collapse, all points to one location. The high-λ preset shows the opposite: no alignment, points spread evenly with no class structure.

The failure modes of Alignment + Uniformity are particularly clean because the loss has only one balancing knob:

λ = 0 (alignment only). The loss is minimised by collapsing all points to a single location. Positives are perfectly aligned, but so is everything else.
λ very high (uniformity only). Points spread evenly across the sphere with no regard for class structure. Accuracy at chance.

The cleanness is the contribution. Wang & Isola don’t claim a new SOTA; they claim that this decomposition is what InfoNCE was doing all along, and once it is written explicitly you can see why InfoNCE works (the two objectives are intrinsically separable on the sphere) and predict when it will fail (whenever one of them is dominating).

By the end of 2020 the contrastive lineage looked finished. The full pipeline — pair → triplet → InfoNCE → CLIP → SupCon → align-and-uniform — covered every modality of comparison the literature had needed. The remaining work was supposed to be parameter tuning and architectural search.

What the lineage missed was a question none of its members had asked: should the negatives target really be cosine $-1$ ? Two years later, the answer arrived from inside Google.

7. Questioning the target: SigLIP, 2023

Zhai et al.’s Sigmoid Loss for Language Image Pre-Training — SigLIP — replaces CLIP’s softmax with a pairwise sigmoid:

\mathcal{L}_{\text{SigLIP}} \;=\; \sum_{i, j} \log\!\Bigl(1 + \exp\!\bigl(-y_{ij}\,(t \cdot \mathrm{sim}_{ij} + b)\bigr)\Bigr)

with $y_{ij} = +1$ for matched pairs (image $i$ with caption $i$ ) and $-1$ otherwise, $t > 0$ a learned temperature, and $b$ a learned bias. The structural change is twofold. Pairwise, not softmax: each pair $(i, j)$ contributes independently — there is no normalisation over the batch. Learnable bias: the bias $b$ controls where negatives equilibrate. For a mismatched pair, the loss saturates (gradient vanishes) once $t \cdot \mathrm{sim}_{ij} + b \ll 0$ , i.e., once $\mathrm{sim}_{ij} < -b/t$ . The bias is the loss’s stop sign on the cosine axis.

The implications, for the geometric question the lineage had been ignoring:

SigLIP’s negatives target is configurable, not pinned to $-1$ . The optimiser stops pushing once a negative is far enough — and the threshold is whatever the bias says it is.
Because $t$ and $b$ are learnable, the network can discover a target rather than chase $-1$ indefinitely. The loss saturates well short of opposition — the bias gives the gradient an explicit stopping point, which is the structural difference from the softmax family. (Exactly where a trained bias lands is an empirical question I won’t put a number on here without a run to cite.)
The pairwise structure removes the $N \times N$ softmax denominator. Small batches work; the engineering bottleneck that defined CLIP-era training is gone.

preset data τ

speed

iter 0 · acc 0%

SigLIP (Zhai et al., 2023). The slider is the equilibrium cosine for negatives — the value of the bias, divided by the temperature. Default targets orthogonality (cos ≈ 0); the 'target opposition' preset recovers CLIP-style behaviour. The 'shallow target' preset shows the configurability that no other loss has.

The named regimes:

Target orthogonality (cos = 0). Negatives are pushed apart only until they reach orthogonality, then the gradient vanishes — the classes settle on perpendicular axes and the loss flattens. For a large class count this sits right next to the simplex optimum; for very few classes it stops short of it (the simplex would prefer a negative cosine), which is a feature here — no runaway spreading — not a match to the optimum.
Target opposition (cos = $-1$ ). Setting the bias to recover CLIP-style behaviour. Negatives are dragged toward antipodal. For two classes that is the simplex optimum, so 2D shows no penalty; the cost only appears with many classes in high dimensions, where insisting on $-1$ for every pair is geometrically unsatisfiable and the gradient burns batch size fighting it.
Shallow target (cos = $-0.5$ ). A deliberately chosen intermediate stop. The point is not that $-0.5$ is special — it is that the bias lets you pick the stopping cosine, which no other loss in the family exposes.

SigLIP is the first member of the lineage that names the target on the cosine axis as a tunable object rather than inheriting the cosine- $-1$ gradient direction from Hadsell’s margin. Its better small-batch accuracy follows from giving the negatives an explicit stopping point instead of an open-ended push.

The remaining question is whether to hard-code the stop rather than learn it. The cleanest such objective — pull positives to cos $1$ , push negatives to cos $0$ , stop — is the next section.

8. Giving the loss a place to stop: cosine-to-zero

Two claims need separating here, because the previous post and parts of the literature blur them. The first is about equilibrium: where do these losses actually land? The second is about dynamics: what does the gradient do along the way? They have different answers, and only the second is a real complaint.

On equilibrium, the softmax family is mostly fine. InfoNCE, CLIP, and SupCon do not converge to all-pairs-at- $-1$ — that configuration is geometrically impossible for more than two classes, since a sphere holds only one antipodal pair per axis. What they converge to is the regular simplex, every inter-class cosine equal to $-1/(k-1)$ (this is the neural-collapse result). The simplex is the near-optimal arrangement. So the equilibrium is right, or close to it.

The real issue is dynamics. The per-pair gradient of every one of these losses points at $\cos = -1$ , with no built-in notion of “far enough.” Pair contrastive’s margin stops it; the softmax family’s only brake is what the rest of the batch geometrically permits — which is why the spreading never quite stops, why temperature is the de-facto throttle, and (the expensive consequence) why CLIP-scale training needs enormous batches to supply enough negatives to approximate the simplex against an objective that is always pulling further than the geometry allows. The losses end up in roughly the right place; they take a wasteful path to get there, and they carry no signal for when to stop.

That is what an explicit target buys. The simplex optimum, $-1/(k-1)$ , is $-1$ only for two classes and collapses toward $0$ as classes multiply — $-\tfrac12$ at three, $-\tfrac{1}{9}$ at ten, essentially $0$ by ImageNet scale. So a fixed, dimension-independent stop at orthogonality ( $\cos = 0$ ) is a clean approximation to the simplex target: it slightly under-separates for small $k$ (it would leave margin on the table at two or three classes) and is almost exact for large $k$ . SigLIP makes that stop a learnable bias; the next loss hard-codes it at zero.

The cosine scale has three landmarks, not two: alignment at $+1$ , orthogonality at $0$ , opposition at $-1$ . Two senses of “different” pull in different directions here. By distance on the sphere, opposition is maximal — antipodal points are as far apart as points can be. But by independence, opposition is redundant: a vector and its negative span a single line, so each is fully determined by the other ( $\cos^2\theta = 1$ ). Orthogonality is where two directions carry no information about each other ( $\cos^2\theta = 0$ ) — independent, not merely distant. When you are packing many classes into a sphere, independence is the resource you care about, and chasing distance past orthogonality spends capacity on anti-correlation you cannot use. I made this argument in detail in Opposite Is Not Different; the simplest objective consistent with it is to pull positives to cosine $1$ and push negatives to cosine $0$ , not $-1$ :

\mathcal{L}_{\text{cos→0}}(i, j) \;=\; \begin{cases} 1 - \cos(z_i, z_j) & y_i = y_j \\ \cos(z_i, z_j)^2 & y_i \neq y_j \end{cases}

preset data τ

speed

iter 0 · acc 0%

Cosine→0 (the orthogonality objective). Default preset is two classes from random data: they settle on perpendicular axes and the loss flattens — the objective is exactly satisfiable when the class count does not exceed the dimension. The 'four classes' and 'eight classes' presets show the ceiling: more orthogonal directions than 2D can hold.

What this loss does on random data is cleanest with two classes in 2D: it lands them on perpendicular axes, every inter-class cosine hits zero, and the optimisation halts because the objective is genuinely satisfiable — orthogonality is achievable exactly when the number of classes does not exceed the dimension ( $k \le d$ ). With four classes in 2D it cannot be: only two orthogonal directions exist, so the loss is frustrated from the start and settles at a compromise below perfect separability — the same geometric ceiling as the eight-class case, just milder. (This is also a place where orthogonality is less efficient than the simplex it approximates: the simplex packs $k$ classes into $k-1$ dimensions, while exact orthogonality needs a full $k$ . cos→0 trades a little dimension-efficiency for a dimension-independent, overshoot-free target.)

The failure modes are revealing in a different way than the rest of the family:

Too many classes for the dimension. As above: with $k > d$ classes the objective is unsatisfiable, the loss compromises (eight classes in 2D settle near a uniform $\pi/4$ spacing), and accuracy plateaus below 100%. This is a geometric ceiling, not an optimisation failure — cosine-to-zero doubles as a diagnostic that the embedding dimension is too small for the class count.
Concentric rings. Same problem as InfoNCE. Cosine throws away the norm.

What sets cosine-to-zero apart is not that it finds a better target than the rest — for large $k$ its target and the simplex nearly coincide, and for small $k$ the simplex is actually a touch better. It is that it carries an explicit stop: the negative penalty is exactly minimised at orthogonality, so the loss flattens there instead of pushing forever. SigLIP gets the same stop from a learned bias; the rest of the family has no stop at all, only the geometry of the batch.

The race

You have seen each loss work alone. Now watch them race. Same dataset, same initial points, same step counter — only the loss differs. Eight panels in two rows: the historical lineage on top (Hadsell → FaceNet → InfoNCE → CLIP), and the modern wave on the bottom (SupCon → SigLIP → Align+Uniform → Cosine→0).

Eight contrastive losses, one dataset

speed

data n seed iter 0

All eight losses, same initial points, same step counter. The default dataset is random, so every panel starts from chaos. Hover any point to highlight it in every panel. Try changing the dataset to see how each loss handles structured input.

A handful of patterns become visible only when the panels are running side by side:

Pair and triplet stop early. Once the margin is satisfied the gradient vanishes and the points freeze. This is the margin family’s signature behaviour: a clean, early termination, with whatever geometric capacity remains unused.

Softmax-based losses never stop. InfoNCE, CLIP, SupCon, and Alignment + Uniformity all have non-vanishing gradients. They keep optimising past the point of perfect classification — fine-tuning the angular spread of the negatives long after nearest-centroid accuracy hits 100%. They are heading somewhere sensible (the simplex), but with no internal notion of having arrived; $\tau$ is the de-facto throttle, and at scale the open-ended push is what forces the enormous batch sizes. CLIP and InfoNCE look near-identical in 2D — they are the same loss with a symmetrisation factor.

SupCon is the harshest collapser. Many positives means each point is pulled toward its class centroid every step. Classes become near-Dirac on the circle. Great for classification, terrible if you care about within-class variation.

SigLIP and cosine-to-zero carry a stop. SigLIP’s pairwise sigmoid saturates once each negative is past the bias threshold; cosine-to-zero’s quadratic penalty on negative cosines is exactly minimised at orthogonality. Of the eight, these are the only two whose loss curve flattens on its own rather than spreading until the geometry refuses — which, for large class counts, is close to the simplex they’re all chasing.

What this leaves out

The post is restricted to the family of contrastive losses that act on pairwise distances or similarities. Two adjacent ideas worth mentioning, which don’t translate to 2D:

Barlow Twins (Zbontar et al., 2021) and VICReg (Bardes et al., 2022) bypass the negative-sample problem by regularising the batch’s cross-correlation matrix toward identity. They don’t need negatives, but they need enough embedding dimensions for the diagonal and off-diagonal terms to make sense.
DINO, MoCo, BYOL use a momentum encoder and a slow-moving teacher to generate targets, removing the need for explicit negatives entirely.

The eight losses racing here all act on the same primitive (pairwise similarities) and disagree about a single geometric question: how far should different-class pairs be? Once you watch the disagreement play out at 60 random points in 2D, with each loss’s named pathologies exposed one preset at a time, the trade-offs the literature has been arguing about for nearly two decades stop being abstract.

The pattern seven of them share is not that they land in the wrong place — they converge to the simplex, which is near-optimal. It is that their gradient points at $\cos = -1$ with no idea when to stop, and inherits that direction unexamined from Hadsell’s margin. The cost is dynamical: a wasteful path, and at scale, batch sizes large enough to brute-force the simplex against an objective that always pulls further. SigLIP was the first to give the negatives an explicit stop by learning a bias; cosine-to-zero hard-codes that stop at orthogonality — a dimension-independent approximation to the $-1/(k-1)$ simplex target that is nearly exact for many classes and deliberately conservative for few. The useful idea was never “opposition is wrong” in some absolute sense; for two classes it is exactly right. It is that a loss should know where to stop, and orthogonality is a good place to put the stop when the classes are many.

Cite as

Bouhsine, T. (2026, May 26). Untangling the Moons: A Visual History of Contrastive Learning. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/untangling-the-moons/

BibTeX

@misc{bouhsine2026untanglingthemoons,
  author       = {Bouhsine, Taha},
  title        = {Untangling the Moons: A Visual History of Contrastive Learning},
  year         = {2026},
  month        = {may},
  howpublished = {\url{https://tahabouhsine.com/blog/untangling-the-moons/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

The setup#

1. The original: Hadsell, Chopra, and LeCun, 2006#

2. The relativisation: FaceNet, 2015#

3. The softmax turn: van den Oord and SimCLR, 2018–2020#

4. Multimodal scaling: CLIP, 2021#

5. Reintroducing supervision: SupCon, 2020#

6. The theoretical decomposition: Wang and Isola, 2020#

7. Questioning the target: SigLIP, 2023#

8. Giving the loss a place to stop: cosine-to-zero#

The race#

Eight contrastive losses, one dataset

What this leaves out#

Cite as

The setup

1. The original: Hadsell, Chopra, and LeCun, 2006

2. The relativisation: FaceNet, 2015

3. The softmax turn: van den Oord and SimCLR, 2018–2020

4. Multimodal scaling: CLIP, 2021

5. Reintroducing supervision: SupCon, 2020

6. The theoretical decomposition: Wang and Isola, 2020

7. Questioning the target: SigLIP, 2023

8. Giving the loss a place to stop: cosine-to-zero

The race

What this leaves out