Linear Attention

Linear attention and Performers: approximating the softmax kernel with random features so the N×N attention matrix never forms.

2 posts tagged #linear-attention.

May 31, 2026

Cheap Attention in JAX/Flax NNX

A runnable companion to Cheap Attention: implement positive-feature linear attention in JAX and Flax NNX, watch the all-pairs ledger turn into a shared feature state, and see exactly where the N×N matrix disappears.
May 31, 2026

Cheap Attention: Linear-Time Kernel Approximation

A 128K-token context creates billions of pairwise questions per attention head. But the N×N matrix is not the essence of attention; it is the receipt for an infinite feature map we never wrote down. Approximate that feature map with random features, reassociate the sum, and softmax attention becomes linear-time kernel attention. The whole argument is built from live in-browser visualizations.