What an MLP Knows, When It's a Kernel
#ml#interpretability#kernels#mlp#transformers#mechanistic-interpretability#neural-networks#yat-unit#deep-learning
A transformer block has two layers, and one of them is read in a way the other is not. Attention heads have names — induction, name-mover, positional, copying — drawn from the algorithmic behaviour they implement, pointed at in code, ablated to confirm a single function disappears. The position-wise MLP, sitting in the same residual stream and consuming a comparable parameter budget, almost never affords this kind of reading. The standard interpretability move for the MLP is to give up on its native representation and project into something else — fit a sparse autoencoder, train a probe, look at activation magnitudes — then try to map back to the layer’s units. The MLP does not help.
It does not help for a structural reason, not a difficulty of depth or scale. Attention is a kernel machine — I made the argument at length in Attention is Explainable Because it is a Kernel — and the four objects it supplies (a real-valued pairwise score on every input/unit pair, a normalised contribution mass, a geometry on the inputs, and a basis of meaningful per-unit directions) are exactly the ones interpretability tools spend years rebuilding for layers that lack them. The standard MLP layer is not a kernel machine. Its primitive is an affine map followed by a pointwise nonlinearity, and pointwise nonlinearities are bad for geometry: the same diagonal modulation that buys selectivity destroys the geometry the linear part might have carried, leaving the layer’s output without a native similarity score, without a normalised contribution, without a distance.
This post asks the constructive question. What does an MLP look like when its primitive is a kernel? Which of attention’s four objects show up, and what do they let you do?
What’s actually inside the block
Before the structural argument, three step-through diagrams of what the layers in question do — useful even if the rest of the post is review for you. Click play to animate, or step manually.
A standard transformer block has two sub-layers wired around a residual stream:
The attention sub-block is a kernel machine in plain sight: a query–key inner product makes a similarity matrix, a row-wise softmax turns it into a contribution distribution, and the result is applied to values. (For the longer reading — RKHS, Mercer, why softmax-normalised similarity is the Nadaraya–Watson smoother — see Attention is Explainable Because it is a Kernel.)
The MLP sub-block is the other side. It is an affine map, a pointwise activation, and another affine map, with nothing in between that names which inputs the layer is responding to.
Three diagrams, two structurally different layers. Attention has a kernel; the MLP doesn’t. The rest of the post is about what changes if we put one in.
What a kernel layer carries
The shortest argument for what a kernel-shaped unit gives you is to put one next to the standard ReLU unit and show them side by side. The ReLU unit’s response is a half-plane: above the affine hyperplane it is positive and grows without bound; below the hyperplane it is zero. There is no notion of a peak in input space, no symmetric “this is the input the unit is looking for” — the unit carries a direction, not a point. (This is the same distinction Opposite Is Not Different draws when it argues that cosine similarity has three landmarks: the unit’s direction on the sphere is only ever maximally different from its negation, not from an orthogonal alternative.) The kernel unit, by contrast, is a localised bump centred at a learnable point.
standard MLP unit · max(0, w · x + b)
kernel unit · (x · W)² / (‖x − W‖² + ε)
Four properties fall out of the kernel type, in direct parallel to attention.
A pairwise score on every input/unit pair. is a real number that says how strongly matches the unit’s centre. The unit’s activation is literally that score; this is what the layer computes. The ReLU unit’s output is a scalar function of one direction, with no symmetric notion of similarity between the input and any other point.
A learnable centre in input space. is the point the unit is looking for. Because the response is highest when matches , the weight vector is, formally, a soft prototype — interpretable in the same vocabulary as the network’s inputs. The standard MLP weight is a direction; the kernel unit’s weight is a point.
A normalised contribution mass. If the layer’s downstream consumer normalises the activations — by softmax, , anything that converts non-negative scores to a partition of unity — the result is a contribution distribution over prototypes. Statements of the form ” of this output came from unit ” are first-class.
A geometry on the inputs. The kernel induces a metric on input space: two inputs are close iff their unit-score profiles are close. For a Mercer kernel the metric is genuinely Riemannian; the layer pulls the network’s downstream notion of “near” back to a metric on that the network respects.
These are the four objects practitioners pick up when they read an attention head and put down when they try to read a position-wise MLP.
The two parameters of a kernel unit
Each unit in a kernel layer has two parameters, and they play different roles. The first is the prototype . It is a point in input space — formally the location of a kernel section , which is a single function in the RKHS associated to the kernel . The unit’s response to an input is a measurement against that section: how close, in the kernel’s geometry, is to . Selecting a neuron, in this layer, means evaluating — the prototype’s RKHS section sampled at the input.
The second parameter is the readout coefficient . It is not a similarity score. It is the weight the layer places on the prototype’s contribution to the layer’s output. Positive means “this prototype pushes the output up when its kernel fires”; negative means “this prototype pushes the output down.” Magnitude is how loudly. With units, the layer computes a finite kernel expansion in the RKHS,
This is the same finite kernel expansion classical SVMs and Gaussian processes use; the difference is that here the centres are learned end-to-end rather than fixed to the training data, and the are the readout coefficients that turn the population of prototype responses into a layer output.
The picture above is the structural content of an entire MLP block compressed to one page. Each prototype is a function in the RKHS — its bump in the heatmap. The readout coefficients pick out which bumps add positively and which subtract. The input activates the units in proportion to how close it is to each prototype. The layer’s output is the signed sum of those activations.
One instantiation: the Yat unit
To make the kernel concrete, fix a particular choice of . The Yat unit on uses
with prototype , bias , regulariser , and readout . The fraction is the kernel similarity : its denominator is a regularised squared Euclidean distance from to , minimised at ; its numerator is a squared inner product. The unit’s score is high when is both close to and aligned with it. The readout multiplies that score to give the unit’s actual contribution to the layer.
The Yat kernel is one choice in a family. Any positive-definite furnishes the four objects above; the geometry it induces — and therefore what the layer treats as “near” and “far” — depends on the particular kernel. The next viz puts four common kernels side by side with the same prototype and the same input , so the choice’s signature is visible.
Gaussian
Laplace
Yat
polynomial
A small architecture that uses it
The interesting question is what the four affordances buy when a whole network depends on them. Build a network with three inputs — an image of digit , an image of an operator (), an image of digit — a shared CNN encoder that maps each input to a -d embedding, a single Yat layer mapping the concatenated -d vector to unit activations, and a small ConvTranspose decoder that paints the answer as a image with three slots: sign, tens, units.
The single-layer trunk is the point. With one Yat layer rather than a stack, every row of its weight matrix is a named prototype in the encoder’s embedding space, and the network’s entire mid-stream representation is the matrix and the 256 unit activations it produces.
Each row partitions naturally into three slots matching the three input embeddings,
so the unit’s prototype factorises by input role: a centre for the first digit, a centre for the operator image, a centre for the second digit. The four operations below all act on this matrix.
Four operations that follow
Naming. For each unit, find the library symbol whose encoder embedding is nearest to each slot of . The unit’s role is “fire on inputs that look like this triple.” No SAE, no probe; the weights of the layer are the dictionary. Of the trunk units in the trained network, have an operator symbol as the maximiser of their middle-slot prototype .
Visualisation. Push the one-hot activation through the decoder. The result is the per-unit footprint image — the pixels unit alone paints into the output, with everything else off. Most footprints are spatially localised to a single output slot; the model has carved itself into a slot alphabet whose neurons are the trunk’s prototypes pushed forward by the decoder.
Ablation by name. Identify the units whose prototype matches a category in the named vocabulary. Zero those rows. By the layer’s kernel structure the rest of the units still fire on their own prototypes; only the targeted ones go silent. Specificity is a property of the prototype, not of a probe trained to find it.
Slot-level surgery. Where the unit’s input partitions into named subspaces, the prototype partitions with it. Zeroing only the operator-slot subspace of a tagged unit silences the unit’s reading of the operator image while leaving the digit pathways intact. This is the difference between “the unit contributes to behaviour ” and “the unit contributes to behaviour specifically through input role ” — a mechanistic claim activation steering on a black-box MLP cannot make.
The interventions, live
The ablation operations are the ones that turn descriptive interpretability into a falsifiable causal claim. The widget below applies each of three interventions — zero the entire row of a tagged unit, zero only its operator slot, or zero a random subset of the same size — to the trained trunk, and reports the per-operator OCR change.
The ×-tagged subset is the clearest case. Zeroing the rows whose middle-slot prototype is the glyph drops multiplication OCR from to — a fall of percentage points — while division loses only pp. Zeroing only the middle slot of those same rows reproduces pp of that drop, while collateral damage on every other operator stays below pp. The multiplication computation does not flow through these units’ digit pathways or through a distributed residual code; it flows specifically through the weights of their operator-slot prototypes.
A random subset of units, ablated the same way, drops multiplication by only pp. The targeted intervention is roughly four times as specific as random ablation of the same size.
None of this required a sparse autoencoder, a probe, or an external dictionary. The model’s “knowledge of multiplication” is a subset of rows in a -row matrix, named directly off the kernel structure of the layer, and the prototype labels survive surgical editing of the matrix.
Be the optimiser
The point of the construction is that the layer’s parameters are legible enough that a human can place them by hand. The widget below puts you in the place of gradient descent: pick a classic 2D dataset, click anywhere on the chart to drop a prototype , drag it to a position you think matters, and dial each on the slider beneath. The decision boundary and the per-point accuracy update live. Outlined points are misclassified.
The reason this is possible at all is that the layer’s two parameter families do separable, interpretable work. is where the unit listens; is how its kernel score is read out. Placing prototypes near each class’s mass and turning on the readouts is the same algorithm a kernel SVM would run, performed in your head — and the same algorithm a trained kernel-MLP layer ends up at, performed by gradient descent. The picture you get on the chart is the picture the layer encodes in its weight matrix. Train it instead of clicking it and you’d see the prototypes drift toward the same kind of placement.
This is the affordance the standard MLP doesn’t have. There is no analogous game for ReLU units — placing affine half-planes by hand to separate a moons dataset is possible but unilluminating, because the units don’t tell you what they listen for. The kernel layer’s two-parameter structure makes the construction not just possible but pedagogical: you can see the layer’s job, do its job by hand, check your work, and read out what you did.
The MLP chose to be opaque
The point of the experiment is not that this particular architecture is the right one to scale. The Yat unit is rational where the standard MLP is affine, the optimisation behaves differently, the FLOPs are different, the practical scaling regime is an open question that a M-parameter network does not answer.
The point is structural. The MLP block was illegible not because it was deep, not because high dimensions are opaque, not because activations get distributed across many neurons. It was illegible because its primitive — affine map plus pointwise nonlinearity — does not carry the four objects that make a kernel layer readable. The collapse is mechanical (see Activations Are Bad for Geometry): the resulting layer has no native similarity score, no prototype, no normalised contribution, no induced metric. Everything downstream interpretability has spent the past five years building is an external apparatus to recover the objects the primitive does not supply.
Replace the primitive with one that does supply them and the apparatus becomes superfluous. The same operations attention has enjoyed by construction — name a unit, visualise it, ablate it, slice it by input role — become one-line edits of the layer’s weight matrix, and the resulting claims have the strength attention claims do, because they are statements about the layer’s actual representation rather than about a learned proxy for it.
The MLP block is not inherently opaque. It chose to be, by adopting a primitive that does not carry geometry. The choice is reversible.
Cite as
Bouhsine, T. (). What an MLP Knows, When It's a Kernel. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/what-an-mlp-knows/
BibTeX
@misc{bouhsine2026whatanmlpknows,
author = {Bouhsine, Taha},
title = {What an MLP Knows, When It's a Kernel},
year = {2026},
month = {may},
howpublished = {\url{https://tahabouhsine.com/blog/what-an-mlp-knows/}},
note = {Blog post, Records of the !mmortal Data Scientist}
} For the underlying paper
Bouhsine, T. (2026). Painting Arithmetic: A Rational-Form Network for Visual Symbolic Computation in Latent Space. Supporting experiment. [PDF]
BibTeX
@unpublished{bouhsine2026paintingarithmetic,
author = {Bouhsine, T.},
title = {Painting Arithmetic: A Rational-Form Network for Visual Symbolic Computation in Latent Space},
year = {2026},
note = {Supporting experiment}
}