RKHS
Reproducing kernel Hilbert spaces in deep learning: feature maps, the softmax kernel, positive-definite vs nonnegative kernels, and Mercer theory.
-
Cheap Attention: Linear-Time Kernel Approximation
A 128K-token context creates billions of pairwise questions per attention head. But the N×N matrix is not the essence of attention; it is the receipt for an infinite feature map we never wrote down. Approximate that feature map with random features, reassociate the sum, and softmax attention becomes linear-time kernel attention. The whole argument is built from live in-browser visualizations.
-
Attention is Explainable Because it is a Kernel
Self-attention in transformers is a Nadaraya–Watson kernel smoother. That fact — and not "we visualize the matrix" — is why attention heads are readable while MLPs are not.