Linear Attention
Linear attention and Performers: approximating the softmax kernel with random features so the N×N attention matrix never forms.
-
Cheap Attention in JAX/Flax NNX
A runnable companion to Cheap Attention: implement positive-feature linear attention in JAX and Flax NNX, watch the all-pairs ledger turn into a shared feature state, and see exactly where the N×N matrix disappears.
-
Cheap Attention: Linear-Time Kernel Approximation
A 128K-token context creates billions of pairwise questions per attention head. But the N×N matrix is not the essence of attention; it is the receipt for an infinite feature map we never wrote down. Approximate that feature map with random features, reassociate the sum, and softmax attention becomes linear-time kernel attention. The whole argument is built from live in-browser visualizations.