Why Attention Needs Q and K Projections

June 4, 2026 · 16 min read

#ml #attention #transformers #self-attention #kernels #bilinear #query-key #interpretability

Part 4 of 6Attention Is a Kernel

1Attention is Explainable Because it is a Kernel
2What an MLP Knows, When It's a Kernel
3Cheap Attention: Linear-Time Kernel Approximation
4Why Attention Needs Q and K Projectionsyou are here
5The Kernel Between the Roles
6The Geometry of Attention Is a Choice of Kernel

Runnable JAX companionQ and K Projections in JAX/Flax NNXPrefer to read the code? This post has a hands-on JAX / Flax NNX implementation.Open the JAX companion

Somewhere inside a trained transformer, one attention head is hunting for the previous occurrence of the token it is sitting on. Another wants the opening parenthesis that matches this closing one. Another wants the subject of this verb, or the name whose value should be copied forward. Each head wants one relation, a directed question with a specific kind of answer, not a general feeling of similarity.

Yet the score that has to express all of this is usually introduced as a bare dot product:

s_{ij}=q_i^\top k_j.

That formula is so compact that it hides the real design choice. The model does not take a dot product between the residual stream vectors directly. It first makes two different views of each token,

q_i = W_Q^\top x_i,\qquad k_j = W_K^\top x_j,

and then compares those views. Substitute the projections back into the score:

s_{ij}=x_i^\top W_Q W_K^\top x_j.

This is the point. Query and key projections turn a plain dot product into a learned bilinear form. The matrix

B = W_Q W_K^\top

is the relation the head uses to decide who should listen to whom.

The raw dot product is too honest

If we removed $W_Q$ and $W_K$ , attention would score tokens by

s_{ij}=x_i^\top x_j.

That is a fixed similarity in the residual stream’s native coordinates. It asks: are these two token vectors already aligned?

Sometimes that is useful, but it is too literal. The residual stream is a shared workspace. It contains lexical identity, position, syntax, local features, long-range features, partial predictions, and debris from other heads and MLPs. A head usually does not want “everything similar to me.” It wants one of the relations from the opening: the previous occurrence, the matching delimiter, the value attached to this key, the source token that should be copied.

A raw dot product cannot choose that relation inside the head. It inherits whatever geometry earlier layers happened to put into the residual stream.

A shared projection learns a metric

Can one learned map fix that? The obvious repair is to project both sides the same way:

s_{ij} = (W^\top x_i)^\top(W^\top x_j) = x_i^\top W W^\top x_j.

Now the head has learned a metric. It can mute the residual directions it does not care about, amplify the subspace it does, and measure similarity in coordinates of its own choosing instead of whatever geometry earlier layers left behind. For a moment this looks like everything a head could need: a private, learned notion of “relevant,” one per head.

Try to spend it on a real relation, though, and it snaps. Take the induction relation, “attend to the position whose predecessor is my token.” When the sequence reads a b … a, the final a should score highly against b, the token that followed the earlier a. But b should not score highly against a in return; b’s own question is about positions whose predecessor is b. No matter what $W$ learns, a shared projection cannot express that, because its score is symmetric by construction:

s_{ij}=s_{ji}.

The same representation is used for asking and answering, so token $i$ scores against token $j$ exactly as $j$ scores against $i$ . “Ask for my identity, answer with your predecessor” is a directed relation, and a symmetric score cannot point.

That is a claim about the score, not the whole layer. The full attention operator can still be directional, the softmax normalizes per row, causal masking only looks backward, and values are written through a separate OV map. But the score a shared projection computes is symmetric, and a head usually needs the score itself to be directed.

Separate Q and K create roles

The way out is to stop making one map do two jobs. With separate projections,

s_{ij}=x_i^\top W_Q W_K^\top x_j,

the left side of the bilinear form is the question a token asks, and the right side is the address a token advertises. Those are not the same job.

For example:

a closing parenthesis asks for an opening parenthesis;
a pronoun asks for an antecedent;
a repeated token asks for the token after its previous occurrence;
a name-mover head asks for the token whose value should be copied forward.

In each case, the querying token and the source token can live in different feature coordinates. The query projection extracts “what I need.” The key projection extracts “what I provide.” The dot product then tests compatibility between those two roles.

This is not a new idea. The bilinear score predates transformers: it is Luong et al.’s (2015) “general” (multiplicative) attention, $h_t^\top W h_s$ , and Dozat & Manning’s (2017) biaffine parser, which already used separate projections for the head role and the dependent role, the same query/key split, named differently. What transformers added was doing it many times in parallel, once per head.

And the examples above are not hypothetical. They are documented circuits. The “repeated token asks for the token after its previous occurrence” relation is the induction head (Olsson et al., 2022), built from a previous-token head feeding a head whose query reads the current token and whose key reads each position’s predecessor. The “name-mover” relation is the IOI circuit of Wang et al. (2022). Both are directed: the query role and the key role do different jobs, which is exactly what a non-symmetric $B$ allows.

This is why the projections are not cosmetic. They are what make the score a relation rather than a similarity.

The symmetric and antisymmetric parts

So what exactly did separate $Q$ and $K$ buy over the shared projection? Linear algebra gives a surgical answer. Any matrix splits into a symmetric and an antisymmetric piece,

B = \underbrace{\tfrac12\left(B+B^\top\right)}_{S}+\underbrace{\tfrac12\left(B-B^\top\right)}_{A},

so the score splits too:

s_{ij}=x_i^\top S\,x_j + x_i^\top A\,x_j.

The symmetric part $S$ is a signed metric; the antisymmetric part $A$ is pure directedness, with $x_i^\top A\,x_j = -\,x_j^\top A\,x_i$ , so it contributes the opposite amount to $s_{ij}$ and $s_{ji}$ . This sharpens what separate $Q$ and $K$ actually add. A shared projection can only produce $WW^\top$ , symmetric and positive semidefinite. Separate $Q$ and $K$ relax that in two independent ways at once: they unlock the antisymmetric part $A$ , which is the entire source of the logit-level asymmetry that makes “A asks for B” different from “B asks for A” before masking or row-normalization enter; and they let the symmetric part $S$ itself be indefinite, free to score some aligned directions as incompatible. Directedness and indefinite compatibility are both out of reach for a shared projection. The low-rank bilinear form buys both.

The split also explains a subtlety in the kernel view below. The self-score and, more generally, the quadratic form sees only the symmetric part: $x^\top B\,x = x^\top S\,x$ , because the antisymmetric part vanishes on the diagonal. So a head’s directedness is completely invisible if you only look at how a token scores against itself, it lives entirely off-diagonal, in $A$ .

Bilinearity is the mechanism

Why is such a simple form enough to carry a head’s routing? Because a bilinear score is linear in each argument separately: fix the source token and it is linear in the asker, fix the asker and it is linear in the source:

s(x,y)=x^\top B y.

That separateness gives a head three useful properties.

First, roles. The two arguments are not interchangeable unless $B$ is symmetric, so attention can learn “A asks for B” without also learning “B asks for A.”

Second, composition. A token carrying several features does not get one monolithic score; the bilinear form sums exactly the pairwise feature interactions that $B$ selects, which lets one head score a particular cross-feature relation instead of a whole-token similarity.

And third, a built-in budget. Since $B=W_QW_K^\top$ , its rank is at most the head dimension $d_k$ , so the head cannot learn every possible interaction in the residual stream. It learns a compressed relation: a small set of query features matched against a small set of key features.

That last point is often treated as an efficiency detail, but it is also an inductive bias. Each head gets a limited relation budget. Multi-head attention works because different heads spend that budget on different bilinear relations.

What about position?

The relations in the examples, previous occurrence, matching delimiter, the token after an earlier copy, are mostly positional, and so far position is nowhere in the score. Rotary position embeddings (Su et al., 2021) put it exactly where the bilinear form lives. RoPE rotates the query and key by an angle proportional to their positions, in a set of 2-D planes: $\tilde q_i = R_i\,W_Q^\top x_i$ and $\tilde k_j = R_j\,W_K^\top x_j$ , where $R_m$ is a block-diagonal rotation by $m\theta$ per plane. Their score is

s_{ij}=\tilde q_i^\top \tilde k_j = x_i^\top W_Q\,R_i^\top R_j\,W_K^\top x_j.

The product $R_i^\top R_j$ is a rotation by an angle proportional to $j-i$ , so the score depends on the two positions only through their relative offset. The same query and key now score differently depending on the gap between the tokens. In many frequency mixtures this yields a distance-sensitive compatibility, with high-frequency planes decorrelating faster than low-frequency ones, though the exact profile depends on the content vectors and is not monotone in general. Position is not a separate additive signal here; it is a modulation of the relation itself.

The kernel view

Once the score is written as

s_{ij}=x_i^\top B x_j,

the kernel interpretation becomes clearer. Attention is not using one universal token similarity. Each head learns its own kernel-like compatibility:

K_h(x_i,x_j)=\exp(x_i^\top B_h x_j).

If $B_h$ were symmetric positive semidefinite, this would be a standard learned inner-product kernel after projection. With separate $Q$ and $K$ , $B_h$ need not be symmetric or positive semidefinite. That is the price of directionality and the reason attention is a kernel smoother in the operational sense rather than always a Mercer kernel in the strict sense.

The softmax then turns this compatibility into contribution mass, so the full head reads:

y_i=\sum_j \operatorname{softmax}_j(x_i^\top B x_j)\,v_j.

The bilinear form chooses the relation. The softmax normalizes the relation. The values carry the content.

Two details complete the picture. First, the score is divided by $\sqrt{d_k}$ before the softmax. That scaling is not cosmetic either: the entries of $x_i^\top B\,x_j$ grow with the head dimension, and without the $1/\sqrt{d_k}$ correction the scores would land in the softmax’s saturated region at initialization, where gradients vanish and every head collapses onto a single token. The factor keeps the relation learnable. Second, the values have their own factorization, $v_j=W_O W_V^\top x_j$ , the OV circuit, the dual of the QK circuit (Elhage et al., 2021). The division of labor is clean: the QK circuit decides where a token reads from, and the OV circuit decides what gets written when it does. This post is about the first; the second is the same bilinear story told about content instead of routing.

Why not just learn B directly?

One could write $s_{ij}=x_i^\top Bx_j$ and learn $B$ as a full $d_{\text{model}}\times d_{\text{model}}$ matrix. Transformers factor it instead:

B=W_QW_K^\top.

This does three things at once.

First, it reduces parameters. A full $B$ costs $d_{\text{model}}^2$ parameters per head. The factorized form costs about $2d_{\text{model}}d_k$ , and $d_k$ is much smaller than $d_{\text{model}}$ .

Second, it reduces computation. The model can compute all queries and keys once, then form $QK^\top$ . The factorization matches the matrix multiplication that hardware is good at.

Third, it makes the relation legible. Query features and key features are separable objects. In circuit terms, you can ask what directions write into $Q$ , what directions write into $K$ , and what pairings between them produce a head’s pattern.

That legibility has one limit. Only the product $B=W_QW_K^\top$ affects the scores; the split into $Q$ and $K$ is not unique. For any invertible $M$ , replacing $(W_Q, W_K)$ with $(W_Q M,\,W_K M^{-\top})$ leaves every score unchanged, a gauge freedom in the factorization. So an individual query coordinate has no canonical meaning; what is identified is the relation $B$ and the query/key subspaces it pairs, not a privileged basis inside them. It is the same non-identifiability that attends any factored representation: the factors are a chosen coordinatization of a product that is the only observable object.

The factorization is not merely a trick for speed. It is the head’s relational vocabulary.

What Q and K buy

Without projections, attention says:

attend to tokens that are already similar to me.

With one shared projection, it says:

attend to tokens similar to me in this learned subspace.

With separate query and key projections, it says:

attend to tokens whose advertised features answer my requested features.

That is the whole reason $Q$ and $K$ exist. The dot product supplies a cheap compatibility operation. The projections decide what compatibility means.

The attention mechanism is therefore not “dot product similarity” in the ordinary sense. It is a learned bilinear relation, factorized into a query role and a key role, then normalized into a distribution over values.

What if we solved the bilinearity problem with a nonlinear kernel?

A bilinear form is the simplest object that can express a directed relation, and that simplicity is also its ceiling. $s(x,y)=x^\top B y$ is linear in each argument separately. It can only score relations that decompose into pairwise feature products selected by $B$ . Whatever compatibility a head needs that is not a sum of such products, the bilinear form cannot represent, it has to be approximated by stacking heads and layers around it.

The Q/K factorization makes this concrete: $B=W_QW_K^\top$ is at most rank $d_k$ , and even at full rank it is still bilinear. We bought directionality by giving up symmetry and positive semidefiniteness, which is exactly why $\exp(x_i^\top B\,x_j)$ is a kernel smoother only in the operational sense and not a Mercer kernel in the strict one.

So the closing question is the obvious one, but it has a trap in it. The bilinear score is a stand-in for “how compatible are these two tokens?”, and that is precisely the question a kernel is built to answer. The tempting move is to replace it with a genuine nonlinear kernel $s_{ij}=\kappa(x_i,x_j)$ , positive definite by construction. But a Mercer kernel on the raw residual vectors is symmetric, $\kappa(x_i,x_j)=\kappa(x_j,x_i)$ , so it throws away the very thing this whole post defended. You would gain a real geometry and a richer-than-bilinear comparison, and pay for it by collapsing attention back into symmetric similarity: no query role, no key role.

The sharper move keeps the roles and kernelizes the compatibility between them:

s_{ij}=\kappa\big(f_Q(x_i),\,f_K(x_j)\big) \qquad\text{or}\qquad s_{ij}=\big\langle \Phi_Q(x_i),\,\Phi_K(x_j)\big\rangle.

The query map $f_Q$ still extracts what a token is asking for; the key map $f_K$ still extracts what it advertises. Only the comparison between them changes, no longer capped at a low-rank bilinear form, but a nonlinear, learnable, deliberately role-asymmetric compatibility. That is the real kernel version of attention: not a similarity kernel on tokens, but a compatibility kernel between roles. And it is the question these posts have been heading toward, can the QK relation be made nonlinear and richer while keeping the routing readable and cheap? That construction deserves a post of its own, and it will get one later in this series.

Cite as

Bouhsine, T. (2026, June 4). Why Attention Needs Q and K Projections. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/why-attention-needs-qk-projections/

BibTeX

@misc{bouhsine2026whyattentionneedsqkprojections,
  author       = {Bouhsine, Taha},
  title        = {Why Attention Needs Q and K Projections},
  year         = {2026},
  month        = {jun},
  howpublished = {\url{https://tahabouhsine.com/blog/why-attention-needs-qk-projections/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

References

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.arXiv:1706.03762
Tsai, Y.-H. H., et al. (2019). Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel. EMNLP-IJCNLP 2019.
Elhage, N., et al. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Luong, M.-T., Pham, H., Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015.arXiv:1508.04025
Dozat, T., Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. ICLR 2017.arXiv:1611.01734
Olsson, C., et al. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.arXiv:2209.11895
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ICLR 2023.arXiv:2211.00593
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint.arXiv:2104.09864

The raw dot product is too honest#

A shared projection learns a metric#

Separate Q and K create roles#

The symmetric and antisymmetric parts#

Bilinearity is the mechanism#

What about position?#

The kernel view#

Why not just learn B directly?#

What Q and K buy#

What if we solved the bilinearity problem with a nonlinear kernel?#