The Readout is a Convex Combination of Prototypes
#ml#transformers#mlp#attention#interpretability#kernels#prototypes#residual-stream#mechanistic-interpretability
Write the hidden activation vector of a transformer MLP as and write the columns of as output vectors in the residual stream. Then
That is the readout. The MLP computes a population of feature responses, then reads them out by mixing a set of learned residual-stream prototypes.
Two notes before we start. First, “convex” is the clean special case, not the default: a standard ReLU MLP is conic, and GELU or gated (SwiGLU) MLPs are signed. The title names the regime worth aiming for; much of the post is about when you do and do not get it. Second, throughout we classify the readout by its coefficients (not by the prototypes), treat the MLP output as an additive update to the residual stream, and ignore the bias unless stated — it only shifts the origin of the prototype hull.
The columns of W_out are output prototypes
The first linear map plus activation decides which hidden features are active. The second linear map decides what each active feature writes. Its -th column is the vector written when hidden unit fires alone:
So is not an abstract weight. It is a one-neuron message to the residual stream: the direction, token-logit effect, downstream feature, or circuit ingredient that unit contributes when it turns on. The readout is the superposition of those messages.
This is the MLP analogue of attention’s value vectors. Attention mixes value prototypes indexed by source tokens. The MLP mixes output prototypes indexed by hidden units.
Convex, conic, affine, linear
The exact geometry depends on the coefficients, and it depends on exactly two yes/no questions: are the coefficients nonnegative, and do they sum to one? Those two constraints are independent, so together they carve the readout into four regimes, each with its own reachable set.
Convex (, ). Then
is a convex combination: a point inside the convex hull of the prototypes. This is the cleanest regime. The coefficients are a probability distribution over hidden units, so the output is an expectation, — the layer chooses a categorical distribution over its messages and writes the mean. Every unit contributes a nonnegative share, and the shares add to one.
Conic (, sum free). Drop the normalization and
The direction is still a convex mixture; the coefficient mass is the write intensity, and the normalized chooses the mixture (when the direction is undefined). The actual output norm depends on the prototype geometry too, not on alone. The reachable set is the conical hull — the cone the prototypes generate from the origin. This is the natural reading for a ReLU-style MLP: pick a point in the hull, then scale how loudly you write it into the residual stream.
Affine (, signs free). Keep the sum-to-one but allow negative coefficients, and the output slides along the affine hull of the prototypes — the flat through them. It can leave the hull while staying on its plane; negative coefficients subtract prototypes while the total still normalizes.
Linear (no constraint). With signs and magnitudes both free, the readout is an arbitrary linear combination and can reach the entire linear span. This is the regime of GELU tails, gated and bilinear MLPs, and anything where the hidden state is neither a probability vector nor confined to a cone.
So the slogan needs a 2×2 footnote, not a one-line one:
The MLP readout is a convex combination when the coefficients are nonnegative and sum to one; conic when they are merely nonnegative; affine when they merely sum to one; and a linear expansion when they are unconstrained. Nonnegativity buys you the hull; summing to one pins you to the affine flat; both together give the simplex.
That distinction matters because interpretability statements inherit it. “This unit contributed 30%” only means what it says in the convex or normalized-conic regime. In a signed regime, the correct statement is not contribution mass but vector cancellation.
And even in the convex regime there is a subtlety worth stating plainly. Because a transformer usually has far more hidden units than residual dimensions (), the prototypes are an overcomplete dictionary, and most interior points of their hull admit many convex decompositions. Carathéodory’s theorem guarantees that any point in the hull has at least one decomposition using at most prototypes — but it does not make that decomposition unique (an extreme point can even have only one). So when the choice is not forced, the coefficient vector the network produces is a convex decomposition, not the one. “Unit contributed 30%” is a fact about the mechanism that generated , not a fact recoverable from alone. Contribution mass is a property of the readout map, not of its output.
The figure below makes that literal: it holds the output fixed and continuously rewrites it as different convex combinations of the same prototypes. The point never moves; the “contributions” change completely.
Why attention feels more legible
The attention block forces the clean case:
Attention has three convenient pieces:
- the values are named by source tokens,
- the weights are nonnegative,
- the weights sum to one.
The MLP readout has the first piece but not always the other two. Its prototypes are real: the columns of . But the coefficients are whatever the hidden nonlinearity produces. If the architecture gives you a positive normalized activation over hidden units, the MLP readout becomes attention-like. If it gives you arbitrary signed features, the readout remains a linear expansion but loses the simple mass interpretation.
This reframes a familiar asymmetry. Attention is not interpretable because it has a matrix we can plot. It is interpretable because it presents the output as a convex mixture over named sources. The MLP can be read the same way only after we identify its sources: the hidden-unit output prototypes.
This convex reading of attention is not new: Tsai et al. (2019) showed that attention is a kernel smoother — the Nadaraya–Watson estimator with the exponential kernel as its similarity. The softmax denominator is exactly the kernel normalization. Seen that way, the question of this post is whether the MLP readout admits the same kernel-smoother reading — and under what conditions on its coefficients it does.
The MLP block, rewritten
Rewrite it with columns exposed:
so
This is the whole mechanism. The hidden layer is a detector bank. The output matrix is a dictionary of things the detector bank can write. The MLP output is the detector-weighted mixture of that dictionary. This is not just a notational trick: Geva et al. (2021) found empirically that exactly this decomposition holds in trained language models — the rows of act as keys that fire on human-interpretable input patterns, and the columns of act as values that write predictable updates to the stream. The feed-forward layer is a key-value memory, and the readout is the composition of its memories.
Two bookkeeping points keep the geometry honest. First, the prototype sum is written into the residual stream as an additive update, not as the stream’s new state — is a delta the block contributes, in the spirit of the residual-stream view of Elhage et al. (2021). Second, the bias anchors the picture: the readout is affine, so the prototype hull is centered at rather than the origin. The convex/conic/affine/linear taxonomy describes where can land.
Once written this way, several interpretability operations become obvious.
Name the output prototypes. For each , ask what direction it writes into the residual stream. Does it point toward a token-logit direction? Does it align with another feature basis? Does it feed a known attention query or key direction in the next block? This is what the logit lens does — reading a residual-stream direction directly in vocabulary space (the tuned lens of Belrose et al. (2023) is its calibrated form) — and it is exactly how Geva et al. (2022) interpret the : each value vector promotes a human-readable set of concepts in the output distribution.
Measure the coefficient mass. For an input , inspect or its normalized version. Which prototypes is the MLP mixing? Is the output concentrated on a few prototypes or spread across many?
Separate choice from intensity. The normalized vector says what mixture the layer chose. The norm says how loudly it wrote that mixture. These are different facts and should not be collapsed.
Watch cancellation. If the activation can be signed, inspect positive and negative mass separately. A large output may come from aligned positive prototypes, or from cancellation failing to cancel. A small output may mean no feature fired, or two strong features subtracted.
The prototype duality
There are really two prototype spaces in an MLP.
The rows of are input-side detectors. They decide what the unit listens for. In a kernel-shaped MLP, these become literal input prototypes: points or sections in the input geometry.
The columns of are output-side writers. They decide what the unit says when it fires. These are residual-stream prototypes.
A hidden unit is therefore a small rule:
when the input looks like this, write that.
The standard notation hides the rule because it names the matrices, not the columns. But the rule is there. supplies the condition; supplies the message.
This is the sense in which the MLP readout is an attention-like object. Attention says: “given this query, mix these token values.” The MLP says: “given this feature pattern, mix these hidden-unit values.” The index set changed from tokens to neurons, but the readout logic is the same.
What would make it truly convex?
A standard transformer MLP does not usually enforce a simplex over hidden units. But it could.
One can imagine an MLP block of the form
where the hidden network computes scores over output prototypes and the readout is explicitly a convex combination. This would make the MLP’s contribution mass as readable as attention’s. It would also impose a strong constraint: the block can only write inside the learned prototype hull unless another scalar gate controls intensity.
This is not a hypothetical architecture so much as a rediscovery of two known ones. A softmax over stored prototypes followed by a weighted readout is exactly one retrieval step of a modern Hopfield network (Ramsauer et al., 2020), whose update rule is the attention mechanism; its three regimes — converging to a single prototype, to a metastable average over a few, or to the global mean — are precisely the temperature regimes of the softmax. And if a sparse convex code is wanted instead of a dense one, sparsemax and entmax (Martins & Astudillo, 2016; Peters et al., 2019) replace the softmax with a projection onto the simplex that sets most coefficients to exactly zero — a readout that is convex and sparse, so only a handful of prototypes are ever mixed.
The temperature on that softmax is itself an interpretability dial. As it sharpens, approaches one-hot and snaps to the nearest prototype — hard assignment, a vertex of the hull, vector quantization. As it flattens, approaches uniform and relaxes to the centroid of the prototypes. The whole interior of the hull lives between those two limits, and the same dial appears as the bandwidth of the kernel readout below.
A less restrictive version keeps the scale:
Now the MLP chooses what to write by convex mixture and how much to write by the gate. This is close to how one should mentally parse a positive unnormalized MLP anyway: direction as mixture, norm as intensity.
The gate also tells you what convexity costs. A pure convex readout is bounded: it can only reach the convex hull of . Adding the nonnegative gate extends the reach to the conical hull — and a cone can cover all of , but only if the prototypes positively span the space; otherwise the output stays confined to the cone they generate. A positive spanning set of needs at least vectors in general position. So the interpretability of nonnegative coefficients is not free: it either restricts the block’s output cone or forces redundancy in the prototype dictionary. Signed coefficients buy the full span back unconditionally, at the price of the clean mass interpretation. That is the whole trade in one sentence.
The design question is whether that constraint is worth the interpretability. Attention already made that bargain: per head, its output is a convex mix of value vectors before the output projection (and, since that projection is linear, of the projected values too — though multi-head concatenation and then mix heads, so the whole block is not literally one token-simplex mixture). The MLP historically did not; it bought flexibility by allowing arbitrary signed linear readout. The cost is that its contribution semantics are weaker.
The point
The output projection in an MLP is not a boring shape-fixing layer. It is the dictionary of messages the hidden layer can write to the residual stream.
If the hidden activations are normalized and nonnegative, the MLP readout is literally a convex combination of those messages. If they are merely nonnegative, it is a scaled convex combination. If they are signed, it is a signed expansion over the same prototypes.
That is the useful reframing:
An MLP block is a detector bank followed by a prototype readout. Attention mixes token-indexed prototypes. The MLP mixes neuron-indexed prototypes. The difference is not the existence of a readout; it is whether the coefficients form a clean probability distribution.
Once you see this way, the natural unit of analysis is not a neuron activation by itself. It is the pair: the detector that made the coefficient, and the output prototype that coefficient writes.
What if the coefficients came from a kernel?
Everything above treats the coefficients as whatever the hidden nonlinearity happens to produce, and then asks, after the fact, whether they are nonnegative, normalized, or signed. The interpretability we get is the interpretability the activation function chose to leave us. A ReLU bank gives a conic readout; a GELU bank gives a signed one; only a softmax over hidden units would hand us a clean convex combination — and standard MLPs do not do that.
You can watch this directly. Fix the hidden scores and change only the nonlinearity applied to them. The same scores, read out through four different activations, land in three different geometric regimes — and only one of them keeps the readout inside the prototype hull.
Modern transformers, if anything, lean the wrong way. Gated MLPs — GLU, SwiGLU — form their hidden activations as an elementwise product of two linear projections, , so the coefficients are products of signed quantities and live squarely in the linear regime. The trend in architectures has been toward more expressive, less convex readouts, which makes the question of how to read them only more pressing.
But there is a more direct way to make the readout genuinely convex, and it is the same move that makes attention legible in the first place. Instead of computing scores with a linear map and hoping the activation behaves, compute the coefficients with a kernel against the input-side prototypes:
Two things have to be true for this to be a convex readout, and they are not the same thing. The sum-to-one is automatic — it is just the normalization, true for any kernel wherever the denominator is nonzero. Nonnegativity of the coefficients is the real requirement, and it asks that the kernel be pointwise nonnegative, — not that it be positive definite. These are independent properties, and it is worth being careful here because they are easy to conflate. The linear kernel is positive definite yet takes negative values; the boxcar kernel is nonnegative yet not positive definite. Positive-definiteness is what makes a Mercer similarity with an RKHS — the property that ties this construction back to “attention is a kernel” — but it is nonnegativity, not positive-definiteness, that makes the readout convex.
The kernels worth reaching for satisfy both. The exponential kernel behind softmax attention, the Gaussian, and the Yat kernel are each nonnegative and positive definite. The Yat kernel — introduced by Bouhsine (2026), which proves it positive definite for and universal for — earns the PSD property as a Schur product of a squared-linear numerator (PSD, and manifestly ) with an inverse-multiquadric factor (PSD). That intersection is the corner you want: a similarity that is simultaneously a legitimate Mercer kernel and a source of convex weights.
With such a kernel, the readout is exactly a convex combination of the output prototypes — a Nadaraya–Watson estimator over a learned prototype set. The detector bank stops being an arbitrary feature extractor and becomes a similarity to named input prototypes; the readout stops being a hopeful interpretation and becomes the actual mechanism.
One caveat the normalization hides: where every is tiny — a query far from all prototypes, for a fast-decaying kernel like the Gaussian — the denominator collapses and becomes ill-conditioned, the familiar extrapolation pathology of Nadaraya–Watson. A heavier-tailed kernel like Yat or the inverse-multiquadric, whose weights fall off only polynomially, normalizes far more gracefully off-support. Convexity holds wherever the readout is defined; stability off the prototype set is a separate, kernel-dependent matter — and another reason the Yat kernel is a natural choice here.
This is the whole construction in one picture. Every hidden unit pairs an input-side prototype with an output-side prototype . The query lives in the input space on the left; its kernel similarities to the become the coefficients; those same coefficients mix the in the residual stream on the right. Drag anywhere and the readout can never escape the output hull — convexity is not something you check afterward, it is built into how the coefficients are formed.
That is the question these posts keep circling. Attention already pays for convexity with its softmax. The MLP could pay for it the same way — with a kernel over prototypes instead of a linear map into an activation — and get a readout that is convex by construction rather than by accident.
Cite as
Bouhsine, T. (). The Readout is a Convex Combination of Prototypes. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/readout-as-convex-combination/
BibTeX
@misc{bouhsine2026readoutasconvexcombination,
author = {Bouhsine, Taha},
title = {The Readout is a Convex Combination of Prototypes},
year = {2026},
month = {jun},
howpublished = {\url{https://tahabouhsine.com/blog/readout-as-convex-combination/}},
note = {Blog post, Records of the !mmortal Data Scientist}
} References
- (2017). Attention Is All You Need. NeurIPS 2017.arXiv:1706.03762
- (1964). On Estimating Regression. Theory of Probability & Its Applications 9(1), 141–142.
- (1964). Smooth Regression Analysis. Sankhyā: The Indian Journal of Statistics, Series A 26(4), 359–372.
- (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
- (2021). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP 2021.arXiv:2012.14913
- (2022). Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. EMNLP 2022.arXiv:2203.14680
- (2019). Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel. EMNLP-IJCNLP 2019.arXiv:1908.11775
- (2020). Hopfield Networks is All You Need. ICLR 2021.arXiv:2008.02217
- (2016). From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML 2016.arXiv:1602.02068
- (2019). Sparse Sequence-to-Sequence Models. ACL 2019.arXiv:1905.05702
- (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens. arXiv preprint.arXiv:2303.08112
- (2026). A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance. arXiv preprint.arXiv:2605.03262