Mechanistic Interpretability

Mechanistic interpretability: induction heads, the QK and OV circuits, feed-forward key-value memories, and the bilinear forms that route attention.

3 posts tagged #mechanistic-interpretability.

Jun 4, 2026

The Readout is a Convex Combination of Prototypes

The second linear map in a transformer MLP is not just a projection. If the hidden activations are nonnegative and normalized, W_out reads the active neurons as a convex combination of output prototypes. Two independent constraints, nonnegativity and summing to one, sort the readout into four regimes: convex, conic, affine, and linear. This reframes the MLP readout as the same object that makes attention legible (a weighted sum over named basis elements), connects it to feed-forward key-value memories and modern Hopfield retrieval, and shows when a kernel makes it convex by construction.
May 21, 2026

What an MLP Knows, When It's a Kernel

The transformer MLP is illegible because its primitive does not carry a kernel. Give it one and the four objects that make attention legible follow for free, for the whole network.
May 14, 2026

Attention is Explainable Because it is a Kernel

Self-attention in transformers is a Nadaraya–Watson kernel smoother. That fact, and not "we visualize the matrix", is why attention heads are readable while MLPs are not.