Why Attention Needs Q and K Projections
#ml#attention#transformers#self-attention#kernels#bilinear#query-key#interpretability
The attention score is usually introduced as a dot product:
That formula is so compact that it hides the real design choice. The model does not take a dot product between the residual stream vectors directly. It first makes two different views of each token,
and then compares those views. Substitute the projections back into the score:
This is the point. Query and key projections turn a plain dot product into a learned bilinear form. The matrix
is the relation the head uses to decide who should listen to whom.
The raw dot product is too honest
If we removed and , attention would score tokens by
That is a fixed similarity in the residual stream’s native coordinates. It asks: are these two token vectors already aligned?
Sometimes that is useful, but it is too literal. The residual stream is a shared workspace. It contains lexical identity, position, syntax, local features, long-range features, partial predictions, and debris from other heads and MLPs. A head usually does not want “everything similar to me.” It wants one relation: previous occurrence, matching delimiter, subject for this verb, value attached to this key, source token that should be copied.
A raw dot product cannot choose that relation inside the head. It inherits whatever geometry earlier layers happened to put into the residual stream.
A shared projection learns a metric
The first improvement is to project both sides the same way:
Now the head has learned a metric. It can ignore irrelevant residual directions and amplify the subspace it cares about. This is already much better than the raw dot product.
But it is still symmetric:
The same representation is used for asking and answering: token scores against token exactly as scores against . The compatibility score cannot be role-asymmetric.
That is a claim about the score, not the whole layer. The full attention operator can still be directional — the softmax normalizes per row, causal masking only looks backward, and values are written through a separate OV map. But the score a shared projection computes is symmetric, and a head usually needs the score itself to be directed.
Separate Q and K create roles
With separate projections,
The left side of the bilinear form is the question a token asks. The right side is the address a token advertises. Those are not the same job.
For example:
- a closing parenthesis asks for an opening parenthesis;
- a pronoun asks for an antecedent;
- a repeated token asks for the token after its previous occurrence;
- a name-mover head asks for the token whose value should be copied forward.
In each case, the querying token and the source token can live in different feature coordinates. The query projection extracts “what I need.” The key projection extracts “what I provide.” The dot product then tests compatibility between those two roles.
This is not a new idea. The bilinear score predates transformers: it is Luong et al.’s (2015) “general” (multiplicative) attention, , and Dozat & Manning’s (2017) biaffine parser, which already used separate projections for the head role and the dependent role — the same query/key split, named differently. What transformers added was doing it many times in parallel, once per head.
And the examples above are not hypothetical. They are documented circuits. The “repeated token asks for the token after its previous occurrence” relation is the induction head (Olsson et al., 2022), built from a previous-token head feeding a head whose query reads the current token and whose key reads each position’s predecessor. The “name-mover” relation is the IOI circuit of Wang et al. (2022). Both are directed: the query role and the key role do different jobs, which is exactly what a non-symmetric allows.
This is why the projections are not cosmetic. They are what make the score a relation rather than a similarity.
The symmetric and antisymmetric parts
There is a clean way to see exactly what separate and add over a shared projection. Any matrix splits into a symmetric and an antisymmetric piece,
so the score splits too:
The symmetric part is a signed metric; the antisymmetric part is pure directedness, with , so it contributes the opposite amount to and . This sharpens what separate and actually add. A shared projection can only produce — symmetric and positive semidefinite. Separate and relax that in two independent ways at once: they unlock the antisymmetric part , which is the entire source of the logit-level asymmetry that makes “A asks for B” different from “B asks for A” before masking or row-normalization enter; and they let the symmetric part itself be indefinite, free to score some aligned directions as incompatible. Directedness and indefinite compatibility are both out of reach for a shared projection. The low-rank bilinear form buys both.
The split also explains a subtlety in the kernel view below. The self-score and, more generally, the quadratic form sees only the symmetric part: , because the antisymmetric part vanishes on the diagonal. So a head’s directedness is completely invisible if you only look at how a token scores against itself — it lives entirely off-diagonal, in .
Bilinearity is the mechanism
A bilinear score is linear in the querying token when the source token is fixed, and linear in the source token when the query is fixed:
That gives a head three useful properties.
It is role-aware. The two arguments are not interchangeable unless is symmetric. Attention can learn “A asks for B” without also learning “B asks for A.”
It is compositional. If a token contains several features, the score is the sum of pairwise feature interactions selected by . This lets one head score a particular cross-feature relation instead of a whole-token similarity.
It is low-rank by construction. Since , its rank is at most the head dimension . The head does not learn every possible interaction in the residual stream. It learns a compressed relation: a small set of query features matched against a small set of key features.
That last point is often treated as an efficiency detail, but it is also an inductive bias. Each head gets a limited relation budget. Multi-head attention works because different heads spend that budget on different bilinear relations.
What about position?
The relations in the examples — previous occurrence, matching delimiter, the token after an earlier copy — are mostly positional, and so far position is nowhere in the score. Rotary position embeddings (Su et al., 2021) put it exactly where the bilinear form lives. RoPE rotates the query and key by an angle proportional to their positions, in a set of 2-D planes: and , where is a block-diagonal rotation by per plane. Their score is
The product is a rotation by an angle proportional to , so the score depends on the two positions only through their relative offset. The same query and key now score differently depending on the gap between the tokens. In many frequency mixtures this yields a distance-sensitive compatibility, with high-frequency planes decorrelating faster than low-frequency ones — though the exact profile depends on the content vectors and is not monotone in general. Position is not a separate additive signal here; it is a modulation of the relation itself.
The kernel view
Once the score is written as
the kernel interpretation becomes clearer. Attention is not using one universal token similarity. Each head learns its own kernel-like compatibility:
If were symmetric positive semidefinite, this would be a standard learned inner-product kernel after projection. With separate and , need not be symmetric or positive semidefinite. That is the price of directionality and the reason attention is a kernel smoother in the operational sense rather than always a Mercer kernel in the strict sense.
The softmax then turns this compatibility into contribution mass, so the full head reads:
The bilinear form chooses the relation. The softmax normalizes the relation. The values carry the content.
Two details complete the picture. First, the score is divided by before the softmax. That scaling is not cosmetic either: the entries of grow with the head dimension, and without the correction the scores would land in the softmax’s saturated region at initialization, where gradients vanish and every head collapses onto a single token. The factor keeps the relation learnable. Second, the values have their own factorization, , the OV circuit — the dual of the QK circuit (Elhage et al., 2021). The division of labor is clean: the QK circuit decides where a token reads from, and the OV circuit decides what gets written when it does. This post is about the first; the second is the same bilinear story told about content instead of routing.
Why not just learn B directly?
One could write and learn as a full matrix. Transformers factor it instead:
This does three things at once.
First, it reduces parameters. A full costs parameters per head. The factorized form costs about , and is much smaller than .
Second, it reduces computation. The model can compute all queries and keys once, then form . The factorization matches the matrix multiplication that hardware is good at.
Third, it makes the relation legible. Query features and key features are separable objects. In circuit terms, you can ask what directions write into , what directions write into , and what pairings between them produce a head’s pattern.
A caveat keeps that legibility honest. Only the product affects the scores; the split into and is not unique. For any invertible , replacing with leaves every score unchanged — a gauge freedom in the factorization. So an individual query coordinate has no canonical meaning; what is identified is the relation and the query/key subspaces it pairs, not a privileged basis inside them. It is the same non-identifiability that attends any factored representation: the factors are a chosen coordinatization of a product that is the only observable object.
The factorization is not merely a trick for speed. It is the head’s relational vocabulary.
What Q and K buy
Without projections, attention says:
attend to tokens that are already similar to me.
With one shared projection, it says:
attend to tokens similar to me in this learned subspace.
With separate query and key projections, it says:
attend to tokens whose advertised features answer my requested features.
That is the whole reason and exist. The dot product supplies a cheap compatibility operation. The projections decide what compatibility means.
The attention mechanism is therefore not “dot product similarity” in the ordinary sense. It is a learned bilinear relation, factorized into a query role and a key role, then normalized into a distribution over values.
What if we solved the bilinearity problem with a nonlinear kernel?
A bilinear form is the simplest object that can express a directed relation, and that simplicity is also its ceiling. is linear in each argument separately. It can only score relations that decompose into pairwise feature products selected by . Whatever compatibility a head needs that is not a sum of such products, the bilinear form cannot represent — it has to be approximated by stacking heads and layers around it.
The Q/K factorization makes this concrete: is at most rank , and even at full rank it is still bilinear. We bought directionality by giving up symmetry and positive semidefiniteness, which is exactly why is a kernel smoother only in the operational sense and not a Mercer kernel in the strict one.
So the closing question is the obvious one — but it has a trap in it. The bilinear score is a stand-in for “how compatible are these two tokens?”, and that is precisely the question a kernel is built to answer. The tempting move is to replace it with a genuine nonlinear kernel , positive definite by construction. But a Mercer kernel on the raw residual vectors is symmetric, — so it throws away the very thing this whole post defended. You would gain a real geometry and a richer-than-bilinear comparison, and pay for it by collapsing attention back into symmetric similarity: no query role, no key role.
The sharper move keeps the roles and kernelizes the compatibility between them:
The query map still extracts what a token is asking for; the key map still extracts what it advertises. Only the comparison between them changes — no longer capped at a low-rank bilinear form, but a nonlinear, learnable, deliberately role-asymmetric compatibility. That is the real kernel version of attention: not a similarity kernel on tokens, but a compatibility kernel between roles. And it is the question these posts have been heading toward — can the QK relation be made nonlinear and richer while keeping the routing readable and cheap?
Cite as
Bouhsine, T. (). Why Attention Needs Q and K Projections. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/why-attention-needs-qk-projections/
BibTeX
@misc{bouhsine2026whyattentionneedsqkprojections,
author = {Bouhsine, Taha},
title = {Why Attention Needs Q and K Projections},
year = {2026},
month = {jun},
howpublished = {\url{https://tahabouhsine.com/blog/why-attention-needs-qk-projections/}},
note = {Blog post, Records of the !mmortal Data Scientist}
} References
- (2017). Attention Is All You Need. NeurIPS 2017.arXiv:1706.03762
- (2019). Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel. EMNLP-IJCNLP 2019.
- (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
- (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015.arXiv:1508.04025
- (2017). Deep Biaffine Attention for Neural Dependency Parsing. ICLR 2017.arXiv:1611.01734
- (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.arXiv:2209.11895
- (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ICLR 2023.arXiv:2211.00593
- (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint.arXiv:2104.09864