RKHS

Reproducing kernel Hilbert spaces in deep learning: feature maps, the softmax kernel, positive-definite vs nonnegative kernels, and Mercer theory.

14 posts tagged #rkhs.

Jul 9, 2026

Solving It and Descending It, in JAX/Flax NNX

A runnable companion to the solve-vs-descend post: the Yat kernel and its Gram matrix, the exact kernel ridge solve via Cholesky, the same kernel as a Flax NNX module trained by AdamW with LR sweeps and best-epoch selection, the measured timing wall, minibatching through 511k rows, and the conv trunk the solve can never train. Every number is from the real Kaggle runs.
Jul 9, 2026

One Kernel, Fitted Twice

Kernel methods gave us the theory everyone still wants back, and the field abandoned them over one procedure: the O(n³) solve over an n by n Gram matrix, which cannot minibatch, cannot scale, and cannot sit under other layers. So we took one Mercer kernel and fitted it twice: once by the classical exact solve, once by plain gradient descent on a bank of prototypes. The two machines agree, to a correlation of 0.95, and then the descended one walks through three walls the solved one dies at: a measured memory wall at sixteen thousand rows, a half-million-row dataset the solve cannot touch, and an end-to-end network the solve cannot be.
Jul 2, 2026

The Price List, in JAX/Flax NNX

A runnable companion to the price-list post: kernel ridge in JAX, the representer solve (K + lambda I) alpha = y, the RKHS-norm bill alpha^T K alpha, the effective dimension d_eff = sum lambda_k/(lambda_k + lambda) from the Gram spectrum, and a generalization sweep that draws the U-curve. Every number and every figure is from one analytic solve, no gradient descent.
Jul 2, 2026

Why Regularization Is a Price List

The representer theorem says the optimal weight is a sum over prototypes, but it does not explain why that sum generalizes. The answer is the RKHS norm: a price list that charges each prototype by its eigenvalue, and regularization is just tightening the budget. Four panels show the knob turning.
Jun 26, 2026

What a Weight Can Be, in JAX/Flax NNX

A runnable companion: compute the price list of a kernel in JAX. The eigenvalues are the kernel's spectral density, found with an FFT; the RKHS norm of a weight is a sum over them. A corner is affordable only under a Sobolev kernel, and the same numbers place the Yat kernel: universal and smooth, roomier than a Gaussian but not a Sobolev space.
Jun 26, 2026

What Can a Weight Be?

Once a kernel gives a weight a home, a second question follows: what is the weight allowed to be? Not all reproducing kernel Hilbert spaces are the same. A Sobolev space lets the weight have a sharp corner; a Gaussian's space forbids it; on normalized data the home is a sphere graded by spherical harmonics. A kernel is secretly a price list for roughness, and that list decides everything. Four interactive panels.
Jun 25, 2026

Where a Weight Lives, in JAX/Flax NNX

A runnable companion: build the representer-theorem weight in JAX. A positive-definite kernel, the Gram matrix, a single linear solve for the coefficients, and the weight comes out as a combination of the data, f = sum alpha_i k(x_i, .). A linear weight cannot separate nested rings; the placed kernel weight does, read purely through the kernel as a similarity-weighted vote of the data.
Jun 25, 2026

Where Does a Weight Live?

A standard neuron's weight and its input never actually meet: one is a point you can see, the other an arrow off in its own space, joined only by a shadow. This is what a reproducing kernel Hilbert space fixes: it gives input and weight one shared address, where the optimal weight is built from the data itself and sits right next to it. Four interactive panels.
Jun 21, 2026

Your Neuron Is a Direction. It Should Be a Picture.

Why should a neuron store a direction when it could store a thing? A direction is not a referent you can point at, which is why MLPs are opaque. Put the Yat kernel where the activation was, train on Fashion-MNIST, and every neuron becomes a prototype that lives in pixel space, literally a picture, so the network reads its own predictions: this looks like that, no saliency method required.
Jun 18, 2026

The Yat-Kernel MLP in JAX/Flax NNX

A runnable companion to What a Finite Kernel Buys an MLP: build a layer whose unit is the Yat kernel instead of a linear map plus an activation, assert it is positive definite and nonnegative, write down its exact finite feature map, train it end-to-end on two moons with no activation function, and measure the lazy-loading sparsity, the bounded off-distribution response, the RKHS capacity, and the force field that pulls each prototype onto its data.
Jun 18, 2026

What a Finite Kernel Buys an MLP

Replace the activation function with a finite, explicit, positive-definite kernel, the Yat kernel, and an MLP stops being a stack of linear maps glued by a nonlinearity. It becomes a kernel machine, with locality, attribution, geometry, capacity control, and a feature map you can write down.
May 31, 2026

Cheap Attention: Linear-Time Kernel Approximation

A 128K-token context creates billions of pairwise questions per attention head. But the N×N matrix is not the essence of attention; it is the receipt for an infinite feature map we never wrote down. Approximate that feature map with random features, reassociate the sum, and softmax attention becomes linear-time kernel attention. The whole argument is built from live in-browser visualizations.
May 14, 2026

Self-Attention as Kernel Regression in JAX/Flax NNX

A runnable companion to Attention is Explainable Because it is a Kernel: build scaled dot-product attention from scratch in Flax NNX, prove in code that it is exactly a Nadaraya–Watson kernel smoother, watch the separate q/k projections break positive-definiteness numerically, swap the exp-dot-product kernel for Gaussian, Yat, and linear kernels to see which keep the weights a convex partition of unity, read the temperature as a kernel bandwidth, and train a single head end-to-end to route to a marked token.
May 14, 2026

Attention is Explainable Because it is a Kernel

Self-attention in transformers is a Nadaraya–Watson kernel smoother. That fact, and not "we visualize the matrix", is why attention heads are readable while MLPs are not.