Opposite Is Not Different

· 9 min read

#ml#geometry#contrastive#embeddings

A standard assumption in contrastive learning holds that pushing negative pairs to cosine similarity 1-1 achieves “maximal difference.” This is wrong, and the cost has been substantial.

Two unit vectors u,v\mathbf{u}, \mathbf{v} with cosθ(u,v)=1\cos\theta(\mathbf{u}, \mathbf{v}) = -1 are antiparallel — v=u\mathbf{v} = -\mathbf{u} — and antiparallel means linearly dependent. They span a single one-dimensional subspace; knowing u\mathbf{u} determines v\mathbf{v} exactly. In every algebraic, geometric, and information-theoretic sense they are the same direction with a sign flip. Two vectors at cos=1\cos = -1 are not different. They are redundant.

The correct geometry of difference is orthogonality. Vectors with cosθ=0\cos\theta = 0 are linearly independent: their span has dimension two, the projection of one onto the other is zero, and neither can be reconstructed from the other. Orthogonality is where genuinely new information lives.

The rest of this post makes the consequence concrete. CLIP’s InfoNCE loss implicitly targets opposition; SigLIP’s sigmoid loss equilibrates at orthogonality; cross-entropy classification has always targeted orthogonality; the simplex packing result tells us why every multi-class problem above n=2n = 2 should be reaching for orthogonality rather than opposition. The cosine scale has three landmarks, not two — and the field spent years engineering around the missing landmark.

The cosine scale has three landmarks

The standard mental model puts cosine similarity on a single axis from +1+1 (“most similar”) to 1-1 (“most different”). That model is missing the structure of vector spaces. Three points on the scale are qualitatively different:

cosθ\cos\thetaAlgebraic statusInformation content
+1+1Parallel (same direction)Maximally redundant
1-1Antiparallel (opposite direction)Maximally redundant (sign-flipped)
0\phantom{-}0Orthogonal (perpendicular)Zero shared information

The “difference” axis runs from ±1\pm 1 (dependent) to 00 (independent), not from +1+1 to 1-1.

snap to:
drag the green vector tip
θ
cos θ
cos²θ
1 − cos²θ
dim span(u, v)
Drag the green tip around the unit circle. When v lines up with ±u (cos θ = ±1), the span of (u, v) collapses to a single line — the two vectors are linearly dependent. Anywhere else, the span fills the plane and the dimension is 2. The two cases that look 'most different' on the cosine axis (the parallel and antiparallel snap-points) are the two cases where the span is degenerate. Orthogonality is the configuration that actually carries new information.

The information argument is the cleanest way to see it. Define directional mutual information as Idir(u;v)=cos2θI_\text{dir}(\mathbf{u}; \mathbf{v}) = \cos^2 \theta — the fraction of v\mathbf{v}‘s variance explained by u\mathbf{u}. Both parallel and antiparallel give cos2=1\cos^2 = 1: each one is reconstructable from the other up to a sign. Only orthogonality gives cos2=0\cos^2 = 0: neither carries any information about the other.

— cos θ   ‒ ‒ cos² θ (shared info)   ···· 1 − cos² θ (independence) θ = 90°
Three quantities on the cosine axis. The similarity cos θ (accent solid) bottoms out at 180°. Shared information cos²θ (dashed) is the actual redundancy measure — it is maximized at both 0° and 180°, so 'opposite' and 'identical' are equally redundant. The independence curve 1 − cos²θ (dotted) peaks at exactly 90°. Drag along the chart to read the three values at any angle.

The thesis in one line:

  max difference    cos2θ=0    uv  \boxed{\;\text{max difference} \;\Longleftrightarrow\; \cos^2 \theta = 0 \;\Longleftrightarrow\; \mathbf{u} \perp \mathbf{v}\;}

CLIP optimizes for the wrong target

OpenAI’s CLIP is trained with InfoNCE — a softmax contrastive loss over a batch of NN image-text pairs:

LCLIP=1Nilogexp(simii/τ)kexp(simik/τ).\mathcal{L}_\text{CLIP} = -\frac{1}{N}\sum_i \log \frac{\exp(\mathrm{sim}_{ii}/\tau)}{\sum_k \exp(\mathrm{sim}_{ik}/\tau)}.

The gradient with respect to any negative similarity is strictly positive: the loss decreases monotonically as that similarity decreases. Cosine similarity is bounded below by 1-1, so the global minimum of every negative term is antiparallel alignment. CLIP wants every pair of unlike concepts to be linearly dependent on each other.

This is geometrically impossible at scale. In Rd\mathbb{R}^d, the maximum number of mutually antiparallel unit-vector pairs is exactly dd — one pair per coordinate axis. CLIP’s 512-dimensional embedding space carries at most 512 binary oppositions. ImageNet has 1,000 classes; the real world has millions of concepts. A loss demanding cos=1\cos = -1 for every negative is asking for a configuration that does not exist in the space it’s being computed in.

The computational cost follows from the same mistake. Random unit vectors in Sd1\mathbb{S}^{d-1} concentrate near cosine zero with variance 1/d1/d — for d=512d = 512, the standard deviation of a random pair’s cosine is around 0.0440.044. The loss’s gradient is dominated by the rare negatives that happen to lie far enough from the equator to register; most of the batch contributes near-zero signal. To accumulate enough gradient, CLIP was trained with N=32,768N = 32{,}768 pairs per step, an N×NN \times N similarity matrix at 4{\sim}4 GB per device, and a multi-year engineering effort in distributed training, gradient caching, and memory-efficient attention. The engineering was needed because the geometry was wrong.

SigLIP gets it right

Google’s SigLIP replaces the softmax with a pairwise sigmoid:

LSigLIP=i,jlog ⁣(1+exp(yij(simij/τb))),\mathcal{L}_\text{SigLIP} = \sum_{i, j} \log\!\big(1 + \exp\big(-y_{ij}(\mathrm{sim}_{ij}/\tau - b)\big)\big),

with yij=±1y_{ij} = \pm 1 for matched/mismatched and bb a learnable bias. The sigmoid gradient on mismatched pairs vanishes once simij<b\mathrm{sim}_{ij} < b by a few multiples of τ\tau. The loss does not push negatives toward 1-1; it requires only that they fall below the bias.

That single change aligns the objective with the geometry. Random embeddings on Sd1\mathbb{S}^{d-1} already concentrate around cosine zero — exactly where SigLIP is willing to leave them. The loss isn’t fighting the spherical geometry to drag every negative across the equator. The pairwise structure eliminates the N×NN \times N softmax competition. The result is better zero-shot accuracy with smaller batches and less compute. The headline framing is “sigmoid beats softmax,” but the structural framing is sharper: targeting orthogonality lets the network use the geometry it is operating on instead of fighting against it.

Cross-entropy has always targeted orthogonality

The cross-entropy loss has been doing the right thing the whole time. For discrete distributions p,qp, q,

H(p,q)=xp(x)logq(x),H(p, q) = -\sum_x p(x) \log q(x),

and the key fact is its singularity structure: if supp(p)supp(q)=\mathrm{supp}(p) \cap \mathrm{supp}(q) = \varnothing, then H(p,q)=+H(p, q) = +\infty. Disjoint supports are the probabilistic analog of orthogonal vectors — distributions that share no mass, like vectors that share no projection. The cross-entropy singularity is the orthogonality condition lifted into probability.

The same structure shows up directly in classifier weights. For a softmax classifier with logits zk=wkhz_k = \mathbf{w}_k^\top \mathbf{h} and class label yy, the gradient on wrong-class logits zkz_k pushes them toward -\infty. On the unit sphere the loss has competing pressures: wy\mathbf{w}_y wants to be parallel to h\mathbf{h} to maximise zyz_y, and each wk\mathbf{w}_k with kyk \neq y wants to be antiparallel to h\mathbf{h} to minimise zkz_k. With nn classes sharing the same h\mathbf{h}, the antiparallel target cannot be reached by all n1n-1 wrong-class weights simultaneously — they cannot all be h-\mathbf{h}. The equilibrium is whatever configuration best balances those pressures subject to mutual diversity of the wk\mathbf{w}_k, and the next section shows that configuration is the regular simplex with wk,h=1/(n1)\langle\mathbf{w}_k, \mathbf{h}\rangle = -1/(n-1): approximately orthogonal for any non-trivial multi-class problem, exactly orthogonal in the nn \to \infty limit, and exactly antiparallel only in the binary case n=2n = 2. The “well-separated representations” that plain cross-entropy classifiers produce without any contrastive auxiliary loss are not a happy accident. They are the simplex, sitting where the geometry says it should.

The simplex packing result

What is the optimal arrangement of nn class representations on Sd1\mathbb{S}^{d-1} for nd+1n \le d + 1? It is the regular simplex: nn unit vectors with all pairwise cosines equal to

cosθij=1n1.\cos\theta_{ij} = -\frac{1}{n - 1}.

For n=2n = 2 this is 1-1 — exactly antiparallel. The binary case is where the opposition-is-difference intuition came from, and where it is correct. The confusion starts when that intuition is generalized.

The optimal inter-class cosine cos θ = −1/(n−1) for n classes on the simplex. The curve crosses from opposition territory (the bottom dotted line, n=2) into orthogonality territory (the top dashed line, the asymptote) almost immediately. By n=3 the optimal cosine is already only −0.5; by n=10 it is −0.11; by n=50, −0.02. Multi-class is orthogonality up to vanishing corrections. Drag the n slider to see where any specific problem sits.

For n3n \ge 3 the optimal configuration moves rapidly toward orthogonality. n=3n = 3: 12-\tfrac{1}{2}. n=10n = 10: 190.11-\tfrac{1}{9} \approx -0.11. n=50n = 50: 0.02{\sim} -0.02. As nn \to \infty, cosθ0\cos\theta \to 0. Beyond the smallest cases the simplex is orthogonal up to a vanishing correction, and any loss that insists on cos=1\cos = -1 for every negative is asking the geometry for something it cannot supply.

What changes if you get the target right

Every method that targets orthogonality (SigLIP, plain cross-entropy) shares two properties: it aligns with the natural concentration of measure on Sd1\mathbb{S}^{d-1}, and it achieves comparable or better accuracy with substantially less compute. Every method that targets opposition (SimCLR, CLIP, SupCon) requires enormous batches to overcome the geometric tension between its objective and the high-dimensional sphere it is operating on.

The thesis, in one sentence: the geometry of difference is not opposition; it is orthogonality. The most influential contrastive losses of the past five years spent enormous engineering effort compensating for a single geometric mistake — and the loss that did not make the mistake was sitting beside them the whole time, in the form of plain cross-entropy.

Cite as

Bouhsine, T. (). Opposite Is Not Different. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/opposite-is-not-different/

BibTeX
@misc{bouhsine2026oppositeisnotdifferent,
  author       = {Bouhsine, Taha},
  title        = {Opposite Is Not Different},
  year         = {2026},
  month        = {feb},
  howpublished = {\url{https://tahabouhsine.com/blog/opposite-is-not-different/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

For the underlying paper

Bouhsine, T. (2026). Opposite ≠ Different: The Orthogonality Thesis. Unpublished manuscript. [PDF]

BibTeX
@unpublished{bouhsine2026opposite,
  author = {Bouhsine, T.},
  title  = {Opposite ≠ Different: The Orthogonality Thesis},
  year   = {2026},
  note   = {Unpublished manuscript}
}