Opposite Is Not Different

February 21, 2026 · 9 min read

A standard assumption in contrastive learning holds that pushing negative pairs to cosine similarity $-1$ achieves “maximal difference.” This is wrong, and the cost has been substantial.

Two unit vectors $\mathbf{u}, \mathbf{v}$ with $\cos\theta(\mathbf{u}, \mathbf{v}) = -1$ are antiparallel — $\mathbf{v} = -\mathbf{u}$ — and antiparallel means linearly dependent. They span a single one-dimensional subspace; knowing $\mathbf{u}$ determines $\mathbf{v}$ exactly. In every algebraic, geometric, and information-theoretic sense they are the same direction with a sign flip. Two vectors at $\cos = -1$ are not different. They are redundant.

The correct geometry of difference is orthogonality. Vectors with $\cos\theta = 0$ are linearly independent: their span has dimension two, the projection of one onto the other is zero, and neither can be reconstructed from the other. Orthogonality is where genuinely new information lives.

The rest of this post makes the consequence concrete. CLIP’s InfoNCE loss implicitly targets opposition; SigLIP’s sigmoid loss equilibrates at orthogonality; cross-entropy classification has always targeted orthogonality; the simplex packing result tells us why every multi-class problem above $n = 2$ should be reaching for orthogonality rather than opposition. The cosine scale has three landmarks, not two — and the field spent years engineering around the missing landmark.

The cosine scale has three landmarks

The standard mental model puts cosine similarity on a single axis from $+1$ (“most similar”) to $-1$ (“most different”). That model is missing the structure of vector spaces. Three points on the scale are qualitatively different:

$\cos\theta$	Algebraic status	Information content
$+1$	Parallel (same direction)	Maximally redundant
$-1$	Antiparallel (opposite direction)	Maximally redundant (sign-flipped)
$\phantom{-}0$	Orthogonal (perpendicular)	Zero shared information

The “difference” axis runs from $\pm 1$ (dependent) to $0$ (independent), not from $+1$ to $-1$ .

snap to:

drag the green vector tip

θ—

cos θ—

cos²θ—

1 − cos²θ—

dim span(u, v)—

—

Drag the green tip around the unit circle. When v lines up with ±u (cos θ = ±1), the span of (u, v) collapses to a single line — the two vectors are linearly dependent. Anywhere else, the span fills the plane and the dimension is 2. The two cases that look 'most different' on the cosine axis (the parallel and antiparallel snap-points) are the two cases where the span is degenerate. Orthogonality is the configuration that actually carries new information.

The information argument is the cleanest way to see it. Define directional mutual information as $I_\text{dir}(\mathbf{u}; \mathbf{v}) = \cos^2 \theta$ — the fraction of $\mathbf{v}$ ‘s variance explained by $\mathbf{u}$ . Both parallel and antiparallel give $\cos^2 = 1$ : each one is reconstructable from the other up to a sign. Only orthogonality gives $\cos^2 = 0$ : neither carries any information about the other.

— cos θ ‒ ‒ cos² θ (shared info) ···· 1 − cos² θ (independence) θ = 90°

Three quantities on the cosine axis. The similarity cos θ (accent solid) bottoms out at 180°. Shared information cos²θ (dashed) is the actual redundancy measure — it is maximized at both 0° and 180°, so 'opposite' and 'identical' are equally redundant. The independence curve 1 − cos²θ (dotted) peaks at exactly 90°. Drag along the chart to read the three values at any angle.

The thesis in one line:

\boxed{\;\text{max difference} \;\Longleftrightarrow\; \cos^2 \theta = 0 \;\Longleftrightarrow\; \mathbf{u} \perp \mathbf{v}\;}

CLIP optimizes for the wrong target

OpenAI’s CLIP is trained with InfoNCE — a softmax contrastive loss over a batch of $N$ image-text pairs:

\mathcal{L}_\text{CLIP} = -\frac{1}{N}\sum_i \log \frac{\exp(\mathrm{sim}_{ii}/\tau)}{\sum_k \exp(\mathrm{sim}_{ik}/\tau)}.

The gradient with respect to any negative similarity is strictly positive: the loss decreases monotonically as that similarity decreases. Cosine similarity is bounded below by $-1$ , so the global minimum of every negative term is antiparallel alignment. CLIP wants every pair of unlike concepts to be linearly dependent on each other.

This is geometrically impossible at scale. In $\mathbb{R}^d$ , the maximum number of mutually antiparallel unit-vector pairs is exactly $d$ — one pair per coordinate axis. CLIP’s 512-dimensional embedding space carries at most 512 binary oppositions. ImageNet has 1,000 classes; the real world has millions of concepts. A loss demanding $\cos = -1$ for every negative is asking for a configuration that does not exist in the space it’s being computed in.

The computational cost follows from the same mistake. Random unit vectors in $\mathbb{S}^{d-1}$ concentrate near cosine zero with variance $1/d$ — for $d = 512$ , the standard deviation of a random pair’s cosine is around $0.044$ . The loss’s gradient is dominated by the rare negatives that happen to lie far enough from the equator to register; most of the batch contributes near-zero signal. To accumulate enough gradient, CLIP was trained with $N = 32{,}768$ pairs per step, an $N \times N$ similarity matrix at ${\sim}4$ GB per device, and a multi-year engineering effort in distributed training, gradient caching, and memory-efficient attention. The engineering was needed because the geometry was wrong.

SigLIP gets it right

Google’s SigLIP replaces the softmax with a pairwise sigmoid:

\mathcal{L}_\text{SigLIP} = \sum_{i, j} \log\!\big(1 + \exp\big(-y_{ij}(\mathrm{sim}_{ij}/\tau - b)\big)\big),

with $y_{ij} = \pm 1$ for matched/mismatched and $b$ a learnable bias. The sigmoid gradient on mismatched pairs vanishes once $\mathrm{sim}_{ij} < b$ by a few multiples of $\tau$ . The loss does not push negatives toward $-1$ ; it requires only that they fall below the bias.

That single change aligns the objective with the geometry. Random embeddings on $\mathbb{S}^{d-1}$ already concentrate around cosine zero — exactly where SigLIP is willing to leave them. The loss isn’t fighting the spherical geometry to drag every negative across the equator. The pairwise structure eliminates the $N \times N$ softmax competition. The result is better zero-shot accuracy with smaller batches and less compute. The headline framing is “sigmoid beats softmax,” but the structural framing is sharper: targeting orthogonality lets the network use the geometry it is operating on instead of fighting against it.

Cross-entropy has always targeted orthogonality

The cross-entropy loss has been doing the right thing the whole time. For discrete distributions $p, q$ ,

H(p, q) = -\sum_x p(x) \log q(x),

and the key fact is its singularity structure: if $\mathrm{supp}(p) \cap \mathrm{supp}(q) = \varnothing$ , then $H(p, q) = +\infty$ . Disjoint supports are the probabilistic analog of orthogonal vectors — distributions that share no mass, like vectors that share no projection. The cross-entropy singularity is the orthogonality condition lifted into probability.

The same structure shows up directly in classifier weights. For a softmax classifier with logits $z_k = \mathbf{w}_k^\top \mathbf{h}$ and class label $y$ , the gradient on wrong-class logits $z_k$ pushes them toward $-\infty$ . On the unit sphere the loss has competing pressures: $\mathbf{w}_y$ wants to be parallel to $\mathbf{h}$ to maximise $z_y$ , and each $\mathbf{w}_k$ with $k \neq y$ wants to be antiparallel to $\mathbf{h}$ to minimise $z_k$ . With $n$ classes sharing the same $\mathbf{h}$ , the antiparallel target cannot be reached by all $n-1$ wrong-class weights simultaneously — they cannot all be $-\mathbf{h}$ . The equilibrium is whatever configuration best balances those pressures subject to mutual diversity of the $\mathbf{w}_k$ , and the next section shows that configuration is the regular simplex with $\langle\mathbf{w}_k, \mathbf{h}\rangle = -1/(n-1)$ : approximately orthogonal for any non-trivial multi-class problem, exactly orthogonal in the $n \to \infty$ limit, and exactly antiparallel only in the binary case $n = 2$ . The “well-separated representations” that plain cross-entropy classifiers produce without any contrastive auxiliary loss are not a happy accident. They are the simplex, sitting where the geometry says it should.

The simplex packing result

What is the optimal arrangement of $n$ class representations on $\mathbb{S}^{d-1}$ for $n \le d + 1$ ? It is the regular simplex: $n$ unit vectors with all pairwise cosines equal to

\cos\theta_{ij} = -\frac{1}{n - 1}.

For $n = 2$ this is $-1$ — exactly antiparallel. The binary case is where the opposition-is-difference intuition came from, and where it is correct. The confusion starts when that intuition is generalized.

classes n n = 10 · cos θ ≈ −0.111

The optimal inter-class cosine cos θ = −1/(n−1) for n classes on the simplex. The curve crosses from opposition territory (the bottom dotted line, n=2) into orthogonality territory (the top dashed line, the asymptote) almost immediately. By n=3 the optimal cosine is already only −0.5; by n=10 it is −0.11; by n=50, −0.02. Multi-class is orthogonality up to vanishing corrections. Drag the n slider to see where any specific problem sits.

For $n \ge 3$ the optimal configuration moves rapidly toward orthogonality. $n = 3$ : $-\tfrac{1}{2}$ . $n = 10$ : $-\tfrac{1}{9} \approx -0.11$ . $n = 50$ : ${\sim} -0.02$ . As $n \to \infty$ , $\cos\theta \to 0$ . Beyond the smallest cases the simplex is orthogonal up to a vanishing correction, and any loss that insists on $\cos = -1$ for every negative is asking the geometry for something it cannot supply.

What changes if you get the target right

Every method that targets orthogonality (SigLIP, plain cross-entropy) shares two properties: it aligns with the natural concentration of measure on $\mathbb{S}^{d-1}$ , and it achieves comparable or better accuracy with substantially less compute. Every method that targets opposition (SimCLR, CLIP, SupCon) requires enormous batches to overcome the geometric tension between its objective and the high-dimensional sphere it is operating on.

The thesis, in one sentence: the geometry of difference is not opposition; it is orthogonality. The most influential contrastive losses of the past five years spent enormous engineering effort compensating for a single geometric mistake — and the loss that did not make the mistake was sitting beside them the whole time, in the form of plain cross-entropy.

Cite as

Bouhsine, T. (2026, 2026). Opposite Is Not Different. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/opposite-is-not-different/

BibTeX

@misc{bouhsine2026oppositeisnotdifferent,
  author       = {Bouhsine, Taha},
  title        = {Opposite Is Not Different},
  year         = {2026},
  month        = {feb},
  howpublished = {\url{https://tahabouhsine.com/blog/opposite-is-not-different/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

For the underlying paper

Bouhsine, T. (2026). Opposite ≠ Different: The Orthogonality Thesis. Unpublished manuscript. [PDF]

BibTeX

@unpublished{bouhsine2026opposite,
  author = {Bouhsine, T.},
  title  = {Opposite ≠ Different: The Orthogonality Thesis},
  year   = {2026},
  note   = {Unpublished manuscript}
}

The cosine scale has three landmarks#

CLIP optimizes for the wrong target#

SigLIP gets it right#

Cross-entropy has always targeted orthogonality#

The simplex packing result#

What changes if you get the target right#