Opposite Is Not Different
A standard assumption in contrastive learning holds that pushing negative pairs to cosine similarity achieves “maximal difference.” This is wrong, and the cost has been substantial.
Two unit vectors with are antiparallel — — and antiparallel means linearly dependent. They span a single one-dimensional subspace; knowing determines exactly. In every algebraic, geometric, and information-theoretic sense they are the same direction with a sign flip. Two vectors at are not different. They are redundant.
The correct geometry of difference is orthogonality. Vectors with are linearly independent: their span has dimension two, the projection of one onto the other is zero, and neither can be reconstructed from the other. Orthogonality is where genuinely new information lives.
The rest of this post makes the consequence concrete. CLIP’s InfoNCE loss implicitly targets opposition; SigLIP’s sigmoid loss equilibrates at orthogonality; cross-entropy classification has always targeted orthogonality; the simplex packing result tells us why every multi-class problem above should be reaching for orthogonality rather than opposition. The cosine scale has three landmarks, not two — and the field spent years engineering around the missing landmark.
The cosine scale has three landmarks
The standard mental model puts cosine similarity on a single axis from (“most similar”) to (“most different”). That model is missing the structure of vector spaces. Three points on the scale are qualitatively different:
| Algebraic status | Information content | |
|---|---|---|
| Parallel (same direction) | Maximally redundant | |
| Antiparallel (opposite direction) | Maximally redundant (sign-flipped) | |
| Orthogonal (perpendicular) | Zero shared information |
The “difference” axis runs from (dependent) to (independent), not from to .
The information argument is the cleanest way to see it. Define directional mutual information as — the fraction of ‘s variance explained by . Both parallel and antiparallel give : each one is reconstructable from the other up to a sign. Only orthogonality gives : neither carries any information about the other.
The thesis in one line:
CLIP optimizes for the wrong target
OpenAI’s CLIP is trained with InfoNCE — a softmax contrastive loss over a batch of image-text pairs:
The gradient with respect to any negative similarity is strictly positive: the loss decreases monotonically as that similarity decreases. Cosine similarity is bounded below by , so the global minimum of every negative term is antiparallel alignment. CLIP wants every pair of unlike concepts to be linearly dependent on each other.
This is geometrically impossible at scale. In , the maximum number of mutually antiparallel unit-vector pairs is exactly — one pair per coordinate axis. CLIP’s 512-dimensional embedding space carries at most 512 binary oppositions. ImageNet has 1,000 classes; the real world has millions of concepts. A loss demanding for every negative is asking for a configuration that does not exist in the space it’s being computed in.
The computational cost follows from the same mistake. Random unit vectors in concentrate near cosine zero with variance — for , the standard deviation of a random pair’s cosine is around . The loss’s gradient is dominated by the rare negatives that happen to lie far enough from the equator to register; most of the batch contributes near-zero signal. To accumulate enough gradient, CLIP was trained with pairs per step, an similarity matrix at GB per device, and a multi-year engineering effort in distributed training, gradient caching, and memory-efficient attention. The engineering was needed because the geometry was wrong.
SigLIP gets it right
Google’s SigLIP replaces the softmax with a pairwise sigmoid:
with for matched/mismatched and a learnable bias. The sigmoid gradient on mismatched pairs vanishes once by a few multiples of . The loss does not push negatives toward ; it requires only that they fall below the bias.
That single change aligns the objective with the geometry. Random embeddings on already concentrate around cosine zero — exactly where SigLIP is willing to leave them. The loss isn’t fighting the spherical geometry to drag every negative across the equator. The pairwise structure eliminates the softmax competition. The result is better zero-shot accuracy with smaller batches and less compute. The headline framing is “sigmoid beats softmax,” but the structural framing is sharper: targeting orthogonality lets the network use the geometry it is operating on instead of fighting against it.
Cross-entropy has always targeted orthogonality
The cross-entropy loss has been doing the right thing the whole time. For discrete distributions ,
and the key fact is its singularity structure: if , then . Disjoint supports are the probabilistic analog of orthogonal vectors — distributions that share no mass, like vectors that share no projection. The cross-entropy singularity is the orthogonality condition lifted into probability.
The same structure shows up directly in classifier weights. For a softmax classifier with logits and class label , the gradient on wrong-class logits pushes them toward . On the unit sphere the loss has competing pressures: wants to be parallel to to maximise , and each with wants to be antiparallel to to minimise . With classes sharing the same , the antiparallel target cannot be reached by all wrong-class weights simultaneously — they cannot all be . The equilibrium is whatever configuration best balances those pressures subject to mutual diversity of the , and the next section shows that configuration is the regular simplex with : approximately orthogonal for any non-trivial multi-class problem, exactly orthogonal in the limit, and exactly antiparallel only in the binary case . The “well-separated representations” that plain cross-entropy classifiers produce without any contrastive auxiliary loss are not a happy accident. They are the simplex, sitting where the geometry says it should.
The simplex packing result
What is the optimal arrangement of class representations on for ? It is the regular simplex: unit vectors with all pairwise cosines equal to
For this is — exactly antiparallel. The binary case is where the opposition-is-difference intuition came from, and where it is correct. The confusion starts when that intuition is generalized.
For the optimal configuration moves rapidly toward orthogonality. : . : . : . As , . Beyond the smallest cases the simplex is orthogonal up to a vanishing correction, and any loss that insists on for every negative is asking the geometry for something it cannot supply.
What changes if you get the target right
Every method that targets orthogonality (SigLIP, plain cross-entropy) shares two properties: it aligns with the natural concentration of measure on , and it achieves comparable or better accuracy with substantially less compute. Every method that targets opposition (SimCLR, CLIP, SupCon) requires enormous batches to overcome the geometric tension between its objective and the high-dimensional sphere it is operating on.
The thesis, in one sentence: the geometry of difference is not opposition; it is orthogonality. The most influential contrastive losses of the past five years spent enormous engineering effort compensating for a single geometric mistake — and the loss that did not make the mistake was sitting beside them the whole time, in the form of plain cross-entropy.
Cite as
Bouhsine, T. (). Opposite Is Not Different. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/opposite-is-not-different/
BibTeX
@misc{bouhsine2026oppositeisnotdifferent,
author = {Bouhsine, Taha},
title = {Opposite Is Not Different},
year = {2026},
month = {feb},
howpublished = {\url{https://tahabouhsine.com/blog/opposite-is-not-different/}},
note = {Blog post, Records of the !mmortal Data Scientist}
} For the underlying paper
Bouhsine, T. (2026). Opposite ≠ Different: The Orthogonality Thesis. Unpublished manuscript. [PDF]
BibTeX
@unpublished{bouhsine2026opposite,
author = {Bouhsine, T.},
title = {Opposite ≠ Different: The Orthogonality Thesis},
year = {2026},
note = {Unpublished manuscript}
}