Residual Stream

3 long-form posts on Residual Stream: machine-learning research by Taha Bouhsine, each built around live, in-browser interactive visualizations.

3 posts tagged #residual-stream.

Jul 9, 2026

A Velocity Ledger for Transformers, in JAX/Flax NNX

A runnable companion: the pre-norm Transformer block as a forward-Euler step, then the residual-stream velocity ledger as one line of Flax NNX state (mu = 0 recovers plain), the ngpt-lite retraction variant, best-val early-stopped training, and the depth telemetry (path length and turning angle per sub-update). Four parameter-matched char-level GPTs that tie on quality and split on dynamics: the ledger's residual-stream path is a third as long and half as sharp.
Jul 9, 2026

Transformers With a Velocity Ledger

A pre-norm Transformer's residual stream is forward Euler: x += Attn(norm x); x += MLP(norm x). So D1's whole dictionary transfers, and the same question follows: does a velocity ledger in the residual stream buy in a Transformer what it bought in a ResNet? The answer splits. On quality, four variants tie. On dynamics, the ledger changes everything: the residual-stream path through depth gets dramatically shorter and straighter, reaching the same answer by a calmer journey. Same destination, gentler road.
Jun 4, 2026

The Readout is a Convex Combination of Prototypes

The second linear map in a transformer MLP is not just a projection. If the hidden activations are nonnegative and normalized, W_out reads the active neurons as a convex combination of output prototypes. Two independent constraints, nonnegativity and summing to one, sort the readout into four regimes: convex, conic, affine, and linear. This reframes the MLP readout as the same object that makes attention legible (a weighted sum over named basis elements), connects it to feed-forward key-value memories and modern Hopfield retrieval, and shows when a kernel makes it convex by construction.