Cheap Attention: Linear-Time Kernel Approximation

· 25 min read

#ml#attention#kernels#performer#linear-attention#transformers#random-features#efficient-transformers#rkhs#self-attention

Runnable JAX companionCheap Attention in JAX/Flax NNXPrefer to read the code? This post has a hands-on JAX / Flax NNX implementation.Open the JAX companion

A 128K-token prompt sounds like a lot of text. Inside a transformer, it is also a lot of pairwise bookkeeping: in every layer, every token asks how much it should listen to every other token. Exact softmax attention writes those answers into an N×NN \times N ledger.

That ledger is useful, but it is not sacred. The softmax score is a kernel, which means it is secretly an inner product in a feature space. The reason exact attention looks quadratic is that the exact feature space is infinite-dimensional, so the implementation falls back to evaluating every pair by hand. Give the kernel a finite approximate feature map, move the parentheses, and the ledger disappears.

That is linear attention: keys and values write once into a shared feature-space state, and each query reads from that state. We will build the idea first through a gravity analogy — all-pairs forces versus a shared field — then derive the same bookkeeping change from the softmax kernel itself. Every figure here runs live in your browser.


Start away from transformers for a moment, because gravity makes the bargain obvious. Take NN bodies pulling on one another. The direct way is a ledger: for each body, sum the pull of every other body — N(N1)N(N-1) interactions, quadratic before the simulation has done anything interesting. The field way changes what you store: the bodies write their mass into a shared gravitational potential, you solve for that field once, and each body reads its local pull from it. The per-body work stops growing with the number of other bodies; it grows only with the cost of maintaining one shared object.

That contrast — all pairs versus one shared object, written once and read many times — is the entire idea, and it is the split you just saw: the left column is the ledger, the right column is the field. We are borrowing only that bookkeeping shape; the field method is an approximation, and we are not claiming attention is secretly gravity.

Now replace planets with tokens.

Attention is the operation that defines the transformer, and at long context it is the one that dominates its cost. Imagine asking a model to read a whole codebase, a litigation folder, or a book-length transcript. The hard part is not only storing the words. In every layer, each of the NN tokens attends to all NN others — an N×NN \times N ledger of who should listen to whom, computed separately in every attention head.

That work is quadratic in the sequence length, and quadratics are merciless: double the context and the cost quadruples. A model that handles a 4K prompt without noticing faces a thousandfold more attention arithmetic at 128K, where the score matrix alone is over sixteen billion entries per head, per layer. This O(N2)O(N^2) is the line item behind every context-length limit, every long document that has to be chunked, and every tool that says “too much text” when what you actually wanted was “read the whole thing.”

The plot is the same ledger-versus-field story in transformer units. At short context, the matrix is small enough that exact softmax is hard to beat. At long context, the quadratic curve bends upward until it dominates everything else. The two figures below show the bookkeeping directly — the same split as the gravity row, now in tokens.

Exact softmax is the ledger: it materializes a score for every query-key pair. Linear attention is the field: it first accumulates keys and values into a feature-space state, then lets every query read that state.

The cost also accumulates during generation. Each new token attends back over the history, so every past key and value is cached and re-read at every step. The cache grows with the sequence, and producing a long answer is itself a quadratic amount of work, paid token by token. Longer context does not cost more once; it costs more at every step from then on.

All of this makes the N×NN \times N table look inevitable — the unavoidable price of letting every token consult every other. It is not. The similarity score exp(qk)\exp(q \cdot k) is a kernel, and every positive-semidefinite kernel is secretly an inner product of features, exp(qk)=ϕ(q),ϕ(k)\exp(q \cdot k) = \langle \phi(q), \phi(k)\rangle. The catch — and the kernel-view reason attention is quadratic — is that for the softmax kernel this ϕ\phi is infinite-dimensional: you cannot write it down, so you have no choice but to evaluate ϕ(qi),ϕ(kj)\langle \phi(q_i), \phi(k_j)\rangle for every pair by hand. (The operational reason is simply that exact softmax needs every pair score unless you approximate it or restructure the computation; IO-aware kernels like FlashAttention make the exact matrix cheap to hold but leave the quadratic arithmetic intact.) That pairwise evaluation is the N×NN \times N matrix — not the essence of attention, but the ledger you are forced to keep when the feature map has no finite form.

Replace that exact, infinite feature map with an approximate, finite one — just mm features — and the inner product becomes an ordinary dot product of short vectors. Now plain associativity lets you regroup the sum so the N×NN \times N matrix never forms, and the cost falls from O(N2)O(N^2) to O(N)O(N). Cheap attention is linear attention: a kernel with a finite feature map, evaluated in the right order. That move — approximate the features, then reassociate — has a name: the Performer.

The same structural fact — that a kernel is an inner product of features — pays out twice. In an earlier post it was what made attention readable; here it is what makes attention cheap. You will not need that post to follow this one; we build everything we use from the score up.

From here the argument has four moves:

  1. Show that the softmax score is a kernel with a hidden feature map.
  2. Replace that infinite feature map with a finite random sketch.
  3. Use the sketch to move the parentheses so the N×NN \times N matrix never forms.
  4. Account for what the bargain buys, what it costs, and what design space it opens.

The matrix is a symptom

Start with the smallest object in the story: one token asking what it should read from the rest of the sequence. Token by token, scaled dot-product attention is

yi=jκ(qi,kj)lκ(qi,kl)vj,κ(q,k)=exp ⁣(qk/d).y_i = \sum_j \frac{\kappa(q_i, k_j)}{\sum_{l} \kappa(q_i, k_l)}\, v_j, \qquad \kappa(q, k) = \exp\!\big(q \cdot k / \sqrt{d}\big).

This is Nadaraya–Watson regression with the softmax kernel κ\kappa: each value vjv_j is averaged according to how similar its key kjk_j is to the query qiq_i. Fold the 1/d1/\sqrt d into the projections (write qq/d1/4q \leftarrow q/d^{1/4}, kk/d1/4k \leftarrow k/d^{1/4}) and the kernel is simply κ(q,k)=exp(qk)\kappa(q,k) = \exp(q\cdot k).

Here is the fact that the quadratic implementation hides. The softmax kernel is exactly an inner product of feature maps:

exp(qk)  =  ϕ(q),ϕ(k).\exp(q \cdot k) \;=\; \big\langle \phi(q),\, \phi(k) \big\rangle.

There is a clean way to see what kind of kernel it is. Complete the square,

exp(qk)=exp ⁣(q22+k22)exp ⁣( ⁣qk22),\exp(q\cdot k) = \exp\!\Big(\tfrac{\|q\|^2}{2} + \tfrac{\|k\|^2}{2}\Big)\,\exp\!\Big(\!-\tfrac{\|q-k\|^2}{2}\Big),

so the softmax kernel is a Gaussian (RBF) kernel reweighted by per-vector norms. That matters because the Gaussian kernel is the canonical case where the feature map exists but is infinite-dimensional. The coordinates are real; there are just infinitely many of them. You can reason about ϕ\phi, but you cannot store it as a vector and take a literal dot product.

How infinite is infinite? Mercer’s theorem diagonalizes the kernel as κ(x,y)=nλnen(x)en(y)\kappa(x,y) = \sum_{n} \lambda_n\, e_n(x)\, e_n(y) — a sum over eigenfunctions ene_n weighted by eigenvalues λn0\lambda_n \ge 0 — and the length of ϕ\phi is exactly the count of nonzero λn\lambda_n. For a polynomial kernel that count is finite. For the Gaussian it is not: its eigenfunctions are the Hermite functions, countably many, and every eigenvalue is strictly positive. The spectrum decays, but it never hits zero. There is no final coordinate where the exact feature map stops.

So in practice we never form ϕ\phi. We evaluate κ(qi,kj)\kappa(q_i,k_j) directly for every pair, and that direct evaluation is precisely the N2N^2 score matrix.

The kernel is an inner product in principle. We just cannot afford the exact coordinates. So we pay by hand, pair by pair — and the N×NN \times N score matrix that results is not the essence of attention. It is the bill for doing an infinite-dimensional dot product without the vector.

That diagnosis changes the problem. We no longer need to make pairwise comparison faster; we need to avoid pairwise comparison by giving the hidden feature map usable coordinates.

Sketch the hidden features

Set attention aside for a section. This trick is older and more general than transformers. Any positive-semidefinite kernel K(x,y)K(x,y) is an inner product of features — a map ϕ\phi into some, possibly infinite-dimensional, space with K(x,y)=ϕ(x),ϕ(y)K(x,y) = \langle \phi(x), \phi(y)\rangle. “Similarity” is literally “alignment after a feature lift.” For the Gaussian kernel K(x,y)=exp(xy2/2σ2)K(x,y) = \exp(-\|x-y\|^2/2\sigma^2) that lifted space is infinite-dimensional, but a finite random sketch already shows the structure.

The escape route is to stop asking for the whole feature map. Rahimi and Recht (2007) observed that many kernels can be written as an expectation of a simple random feature, K(x,y)=Ew[ψw(x)ψw(y)]K(x,y) = \mathbb{E}_w[\psi_w(x)\,\psi_w(y)], so a Monte-Carlo average over mm draws,

K^(x,y)=1mi=1mψwi(x)ψwi(y)=ϕ^(x),ϕ^(y),ϕ^(x)=1m[ψw1(x),,ψwm(x)],\hat K(x,y) = \frac1m \sum_{i=1}^m \psi_{w_i}(x)\,\psi_{w_i}(y) = \big\langle \hat\phi(x), \hat\phi(y)\big\rangle, \qquad \hat\phi(x) = \tfrac1{\sqrt m}\big[\psi_{w_1}(x),\dots,\psi_{w_m}(x)\big],

is an unbiased, finite-dimensional estimate of the kernel with variance 1/m\sim 1/m. Those mm random features are the explicit feature vector you could not write down exactly.

The intuition is mundane: instead of measuring the whole infinite object, throw a few random probes at it and average what they report. A city planner does not need a thermometer on every street corner to estimate the temperature field; enough sensors sketch the field. Random features do the same thing for a kernel. They trade an exact infinite comparison for a finite sketch whose error shrinks as you add probes.

For the Gaussian kernel the recipe has a closed form, due to Bochner: a shift-invariant kernel is the Fourier transform of a non-negative spectral density, so K(δ)=Ewp[cos(wδ)]K(\delta) = \mathbb{E}_{w\sim p}[\cos(w\,\delta)] with the frequencies ww drawn from that density (a Gaussian, for the RBF). Each random feature is one cosine wave at one random frequency. Average enough waves and the smooth bell curve assembles itself.

This has an immediate consequence for cost. Evaluating a kernel on nn points usually means building the n×nn\times n Gram matrix Kij=K(xi,xj)K_{ij} = K(x_i,x_j) — the O(n2)O(n^2) object at the heart of every kernel method. Replace the kernel by mm explicit features and that matrix becomes KΦΦK \approx \Phi\Phi^\top with ΦRn×m\Phi \in \mathbb{R}^{n\times m}: a rank-mm factorization you need never expand.

The Gram matrix is the full ledger again: every point compared against every other point. The feature matrix Φ\Phi is a compressed dossier: each point gets mm coordinates, and any relationship can be reconstructed by taking a dot product. If you only need sums of relationships, expanding the whole chart is wasted motion.

That factorization is the lever. Anywhere a computation sums a kernel over many points, explicit features let you regroup the sum and skip the full matrix. Attention is exactly such a computation. But before we spend the sketch, there is one attention-specific constraint: the sketch has to survive normalization.

The sketch must stay positive

The softmax kernel is a Gaussian kernel up to the per-vector norm factors we completed the square on, so the same random-feature recipe applies. But attention is less forgiving than a generic kernel method. The estimate is not just a number in a loss; it becomes a weight in a normalized average. If that weight is negative, noisy, or occasionally enormous in the wrong place, the whole layer feels it.

That is why the Performer’s FAVOR+ uses positive random features. Draw w1,,wmN(0,Id)w_1,\dots,w_m \sim \mathcal N(0, I_d) and set

ϕ+(x)=exp(x2/2)m[exp(w1x),,exp(wmx)].\phi^{+}(x) = \frac{\exp(-\|x\|^2/2)}{\sqrt m}\,\Big[\exp(w_1^\top x),\,\dots,\,\exp(w_m^\top x)\Big].

A one-line Gaussian-integral computation gives E[ϕ+(q)ϕ+(k)]=exp(qk)\mathbb{E}\big[\phi^{+}(q)\cdot\phi^{+}(k)\big] = \exp(q\cdot k) exactly, so the features are unbiased for the softmax kernel. But the more important property is visible without the integral: every coordinate of ϕ+\phi^{+} is positive. The approximation may be noisy, but it cannot claim that a token has negative attention mass.

There is an older, more obvious choice that also works in expectation and yet fails in practice. The Gaussian factor invites the random Fourier features from the section above — the trigonometric map ϕtrig(x)exp(x2/2)[sin(wix),cos(wix)]\phi^{\text{trig}}(x) \propto \exp(\|x\|^2/2)\,[\sin(w_i^\top x),\cos(w_i^\top x)], the same cosines that built the bell. These are unbiased too. On paper, that sounds enough. Inside attention, it is not.

The positive-feature estimate stays close to the true kernel and, crucially, never goes negative. The trigonometric estimate is unbiased but its variance explodes precisely where the kernel is small — the far tail, the token pairs a good softmax wants to ignore. There it produces negative kernel values, which are nonsensical as attention weights, and it can drive the per-row normalizer toward zero, which makes the whole layer blow up.

This is the difference between correctness in expectation and correctness in a layer you actually train. It is like building a probability table where some entries say “minus three percent likely” and then asking the table to normalize itself. The contribution of the Performer is, in large part, the realization that for a quantity that lives inside a softmax, you want an estimator that is positive and low-variance in the tail, not merely unbiased. Positivity is not a technicality; it is the difference between a stable layer and a divergent one.

Now the ingredients are in place: a finite feature map, a stable positive estimator, and an attention computation that only needs sums over keys and values. We can finally make the matrix disappear.

Spend the sketch

Now spend the approximation. Replace the exact kernel with the inner product of features, exp(qikj)ϕ(qi)ϕ(kj)\exp(q_i \cdot k_j) \approx \phi(q_i) \cdot \phi(k_j), and look at what happens to a single output. This is the moment where the story changes from “compare everything with everything” to “summarize the keys and values once, then let each query read from the summary”:

yi    j(ϕ(qi)ϕ(kj))vjjϕ(qi)ϕ(kj)  =  ϕ(qi)(jϕ(kj)vj)ϕ(qi)(jϕ(kj)).y_i \;\approx\; \frac{\sum_j \big(\phi(q_i)^\top \phi(k_j)\big)\, v_j^\top}{\sum_j \phi(q_i)^\top \phi(k_j)} \;=\; \frac{\phi(q_i)^\top \Big(\sum_j \phi(k_j)\, v_j^\top\Big)}{\phi(q_i)^\top \Big(\sum_j \phi(k_j)\Big)}.

The whole trick is the parenthesis. Because ϕ(qi)\phi(q_i) does not depend on the summation index jj, it pulls out of the sum, and what remains inside —

S=jϕ(kj)vjRm×d,z=jϕ(kj)RmS = \sum_j \phi(k_j)\, v_j^\top \in \mathbb{R}^{m \times d}, \qquad z = \sum_j \phi(k_j) \in \mathbb{R}^{m}

does not depend on ii at all. The sequence has been compressed into two feature-space summaries: one carrying values, one carrying the normalizer. Compute SS and zz once, in a single O(Nmd)O(Nmd) pass over the keys and values, then read out every query as yi=ϕ(qi)S/ϕ(qi)zy_i = \phi(q_i)^\top S \,/\, \phi(q_i)^\top z. In matrix form the whole layer is

Y^=D1ϕ(Q)(ϕ(K)V),D=diag ⁣(ϕ(Q)ϕ(K)1),\widehat{Y} = D^{-1}\,\phi(Q)\,\big(\phi(K)^\top V\big), \qquad D = \mathrm{diag}\!\big(\phi(Q)\,\phi(K)^\top \mathbf{1}\big),

and the parenthesization is the entire point: compute ϕ(K)V\phi(K)^\top V first (an m×dm \times d matrix), and the N×NN \times N object never forms. Softmax forces (QK)V(QK^\top)V, with the N×NN\times N matrix in the middle; random features permit ϕ(Q)(ϕ(K)V)\phi(Q)(\phi(K)^\top V), with only m×dm \times d in the middle. Same associative product, different order of operations, quadratically different cost.

And there is the whole story in one line: it is the same matrix you would have built — just never built. The N×NN \times N object was not the computation you wanted; it was the expanded ledger you were forced to keep because ϕ\phi had no finite form. Give ϕ\phi coordinates and the same sum reassociates around it. Instead of writing down every pairwise debt and adding the columns later, you keep running totals in feature space from the start. The viz below forms the forbidden matrix anyway, on both sides, so you can watch the approximation close.

For autoregressive models the same factorization survives causal masking, and arguably gets prettier. The causal output only sums over jij \le i, so define running prefix sums Si=jiϕ(kj)vjS_i = \sum_{j\le i}\phi(k_j)v_j^\top and zi=jiϕ(kj)z_i = \sum_{j\le i}\phi(k_j); then yi=ϕ(qi)Si/ϕ(qi)ziy_i = \phi(q_i)^\top S_i / \phi(q_i)^\top z_i, a linear recurrence in ii.

A causal linear-attention layer is literally a recurrent network with an m×dm \times d state, processed in one streaming pass with no KV cache that grows with the sequence length (the state size is fixed at m×dm \times d). Instead of carrying the whole conversation as a growing pile of keys and values, the model carries a running summary written in feature coordinates. This is the linear-transformers-are-RNNs observation, and it falls straight out of the kernel factorization.

The long-context payoff

Quadratic versus linear is an asymptotic story, and asymptotic stories can hide bad constants. The opening plot showed the curve shape; the practical question is not “which curve wins eventually?” It is “when does eventually begin?”

Two honest caveats the curves make obvious. First, there is a crossover: random features only win once NN is comfortably past m\sim m. For short sequences the mm features are pure overhead and plain softmax is faster. That is why production stacks reach for exact-but-IO-aware softmax at moderate lengths and save linearization for genuinely long context.

Second, the quieter axis is memory. The N2N^2 that hurts most in practice is often not the FLOPs but the score matrix you must hold while normalizing. Replacing it with an m×dm \times d running state is what makes very long contexts fit at all.

What you give up

None of this is free, and the honest accounting is short. Random features are an approximation: each kernel entry is unbiased, but the normalized output is a ratio of two such estimates, so it carries variance — and a small bias — that only shrinks as mm grows.

At small mm this shows up as degraded quality, most visibly on tasks that need sharp, near-one-hot attention: retrieval, induction, copying, exact lookup. In those cases the exact attention matrix is effectively high-rank — close to a permutation — and a rank-mm feature average simply cannot reproduce it. A lookup table wants the exact seat number; a sketch of the room is not enough. Positive features and orthogonality tame the variance but do not abolish it. And the crossover is real: below it you are paying for features you do not need.

This is why softmax did not disappear. Exact softmax is still the right tool when sequences are short enough, when IO-aware kernels make the matrix affordable, or when the task needs the sharpness of a nearly one-hot lookup. Linear attention is the long-context bargain: give up exact pairwise comparison, get a state whose size does not grow like N2N^2.

The design space opens

The bargain also reframes what softmax is. It is not the only way to do attention. It is one point — the exact, infinite-feature point — in a space of kernels, every other point of which can be made linear by naming its features.

That means a feature map is not just an implementation detail. It decides what the model means by “near,” “aligned,” and “worth attending to.” Change the feature map and you change the geometry the layer sees. One way to make that visible is to fix a handful of key vectors and color each point of the plane by which key wins its attention. The result is a map of the kernel’s territory: which regions belong to which keys, and what kind of boundaries the similarity function draws.

Two things are worth pulling out of this picture. First, FAVOR+ is genuinely an approximation of softmax: its partition converges to the exact dot-product cones as you add features. That is the sense in which a Performer is “doing attention” rather than merely doing something attention-shaped.

Second, not every linear attention is a softmax approximation. The elu()+1\text{elu}(\cdot)+1 map of Katharopoulos et al. (2020) is a perfectly good, cheap feature map, but it defines its own kernel with its own geometry. It is not trying to imitate softmax; it is choosing a different rulebook.

Step back and the unifying statement is almost embarrassingly simple. Every linear-attention method is a choice of feature map ϕ\phi, plugged into the same reassociated product. The papers differ less in the algebra than in the geometry they choose:

Once attention becomes “pick a feature map and reassociate,” softmax is one option in a design space, not the destination. Softmax’s cones make similarity purely angular; other kernels can be built to decay with distance, to prefer locality, or to expose a structure you want the model to use. The quadratic cost was never intrinsic to attention. It was the price of evaluating an infinite feature map pairwise instead of approximating it and reassociating.

The companion post argued that attention is explainable because it is a kernel. The same sentence, read for cost, says attention is linearizable because it is a kernel. A kernel is an inner product of features. Inner products factor. The long-context trick is to stop worshiping the matrix and start choosing the feature map.


References inline. The softmax-kernel feature-map view and FAVOR+ are from Choromanski et al. (2021); the linear-transformer/RNN factorization from Katharopoulos et al. (2020); random features from Rahimi & Recht (2007); the kernel-smoother reading of attention from Tsai et al. (2019); cosFormer from Qin et al. (2022). IO-aware exact attention is FlashAttention (Dao et al., 2022).

Cite as

Bouhsine, T. (). Cheap Attention: Linear-Time Kernel Approximation. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/cheap-attention-is-linear-attention/

BibTeX
@misc{bouhsine2026cheapattentionislinearattention,
  author       = {Bouhsine, Taha},
  title        = {Cheap Attention: Linear-Time Kernel Approximation},
  year         = {2026},
  month        = {may},
  howpublished = {\url{https://tahabouhsine.com/blog/cheap-attention-is-linear-attention/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

References

  1. Rahimi, A., Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS 2007.
  2. Tsai, Y.-H. H., et al. (2019). Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel. EMNLP-IJCNLP 2019.
  3. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.arXiv:2006.16236
  4. Choromanski, K., et al. (2021). Rethinking Attention with Performers. ICLR 2021.arXiv:2009.14794
  5. Qin, Z., et al. (2022). cosFormer: Rethinking Softmax in Attention. ICLR 2022.arXiv:2202.08791
  6. Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.arXiv:2205.14135