What a Finite Kernel Buys an MLP

June 18, 2026 · 17 min read

#ml #kernels #interpretability #mlp #rkhs #yat #geometry #deep-learning

Runnable JAX companionThe Yat-Kernel MLP in JAX/Flax NNXPrefer to read the code? This post has a hands-on JAX / Flax NNX implementation.Open the JAX companion

Every multilayer perceptron is built from one primitive: a linear map followed by a pointwise nonlinearity, $h = \sigma(w^\top x + b)$ . The activation $\sigma$ is the part nobody can quite justify. We reach for ReLU because it trains, GELU because it trains a little better, and we tell ourselves the choice is a detail. It is not a detail. A pointwise activation is a patch over a missing object, and the price of the patch is everything that makes an MLP opaque: no geometry, no centers, no attribution, no capacity you can name.

This post asks a single question. What if the primitive were not a linear map plus an activation, but a finite, explicit, positive-definite kernel, a similarity between the input and a learned center, with no $\sigma$ at all? The nonlinearity, the locality, and the geometry would then come from the kernel, not from a function we bolt on afterward. The kernel I will use is the Yat kernel; the point is the list of things you get for free once the unit is a kernel. This is the catalogue.

It builds directly on two earlier pieces: what an MLP knows when its unit is a kernel, which constructs the unit, and why activations are bad for geometry, which is the problem this construction answers.

First, the kernel you do not mean

There is already a fashionable way a kernel shows up in a neural network, and it is the wrong one for this story. Take a network, send its width to infinity, and in the lazy regime it linearizes around initialization into a fixed kernel, the Neural Tangent Kernel (Jacot et al., 2018). The NTK is real and useful, but look at what it is: emergent (you do not write it down, it falls out of a limit), infinite-dimensional, fixed at initialization, and data-blind. It describes a network that has, in a precise sense (Chizat et al., 2019), stopped learning features, the parameters barely move, and the function is a kernel regressor against a kernel nobody chose.

That is the opposite of what I want. Here the kernel is finite (you write it down in closed form; its core has a feature map of finite dimension), explicit (a design choice, not an asymptotic accident), and learnable (its centers are parameters that move). The network is free to be in the rich, feature-learning regime, the prototypes migrate toward structure in the data, and it is still a kernel machine at every layer, because the kernel is the unit and not a description of the unit’s limit. “Not the NTK” is the whole framing: do not wait for an infinite-width limit to hand you a kernel you cannot steer. Install one.

The unit

For an input $x\in\mathbb{R}^d$ and a learned center $W_u\in\mathbb{R}^d$ , with two non-negative scalars $b\ge 0$ and $\varepsilon>0$ , hidden unit $u$ computes

y_u(x) \;=\; \alpha_u\,\underbrace{\frac{(W_u^\top x + b)^2}{\lVert x - W_u\rVert^2 + \varepsilon}}_{k_{b,\varepsilon}(W_u,\,x)} .

There is no $\sigma$ . The numerator is a squared alignment (how much $x$ points along $W_u$ ); the denominator is an inverse-multiquadric distance gate (how close $x$ sits to $W_u$ ). The unit fires when the input is both aligned with and near its center, and its response is a single localized peak at $x = W_u$ . A ReLU unit, by contrast, carries only a direction and fires across an entire half-space; the kernel unit carries a point. That difference, a center in input space versus a hyperplane, is the source of everything below, and it is developed at length in what an MLP knows.

Two properties make this a kernel and not just a clever activation. It is non-negative ( $b\ge 0$ keeps the numerator a square, the denominator is positive). And it is positive definite: $k_{b,\varepsilon}$ is the Schur product of a degree-2 polynomial kernel $(W_u^\top x + b)^2$ , PSD and, crucially, finite-rank, with an inverse-multiquadric kernel $1/(\lVert x-W_u\rVert^2 + \varepsilon)$ , which is PSD by Schoenberg. A product of PSD kernels is PSD, so $k_{b,\varepsilon}$ is a genuine Mercer kernel; Bouhsine (2026) proves it positive definite for $\varepsilon\ge 0$ and universal for $\varepsilon>0$ . Non-negative and positive definite are independent properties, and the unit has both, the useful corner, as the convex-readout companion shows in code. Everything that follows is a consequence of one of those two facts.

Here is the whole thing learning, in your browser. A layer of these units, one prototype each, no activation function, trained by gradient descent on a 2-D task. Watch the prototypes (the kernel centres) migrate onto the data and the decision field form. The parameters move freely: this is the rich, feature-learning regime, the opposite of the NTK’s frozen kernel, and yet every step is an exact kernel machine.

The rest of this post is the catalogue of what that move buys you.

Feature 1, Lazy loading: the layer evaluates only what is near

Because the response is sharply peaked at the center, the units that matter for a given input are the ones whose centers are near it. On normalized inputs the Yat unit reduces to $k^\circ(S) = (S+b)^2/(\varepsilon + 2 - 2S)$ in the cosine similarity $S=\langle x, W_u\rangle$ , a clean bump that is maximal at $S=1$ and falls off as the input rotates away. A ReLU layer cannot do this: a ReLU unit is active on half of all inputs, so there is no neighborhood to exploit and every unit must be evaluated for every input.

A localized kernel turns dense matrix multiplication into a retrieval problem. The hidden activations are sparse by construction: for any $x$ , only the prototypes in its neighborhood contribute meaningfully, so you can index the centers and fetch the few that are close, nearest-neighbor or approximate-NN over $\{W_u\}$ , instead of touching the whole layer. This is conditional computation that falls out of the geometry of the unit rather than a learned gating network bolted on top, the way a mixture-of-experts router is. It is also why the prototype set is a growable memory: adding a center adds capacity locally, like adding a support vector, and you can load centers on demand rather than holding a fixed dense bank. The activation function gave you none of this; a half-space has no “near.”

Feature 2, Explainability is not bolted on, it is the kernel theory

The reason kernel methods were interpretable long before “interpretability” was a field is that a Mercer kernel comes with a reproducing kernel Hilbert space, and an RKHS hands you four things you otherwise have to reverse-engineer. The same argument I made for attention in attention is explainable because it is a kernel applies verbatim to the MLP unit, because both are now the same object.

A named center. Each unit’s output is a similarity to $W_u$ , a point in input space you can visualize, decode, or compare. “What does this neuron detect?” has a literal answer: whatever lives at $W_u$ . A ReLU neuron’s weight vector is a direction, and a direction does not have a preferred input.
Exact attribution, no surrogate. By the representer theorem (Kimeldorf & Wahba, 1971; Schölkopf et al., 2001), the readout is a weighted sum of kernel evaluations, so the contribution of center $u$ to an output is the kernel weight itself, an exact number, not a LIME/SHAP approximation fit after the fact. When the readout normalizes those weights it becomes a Nadaraya–Watson estimator and the contributions become a convex partition: “unit $u$ accounts for 30% of this output” is then a true statement, not a story.
Capacity you can compute. The RKHS norm of a unit at its peak is $k_{b,\varepsilon}(W_u, W_u) = (\lVert W_u\rVert^2 + b)^2/\varepsilon$ , a single scalar that measures how sharply the unit is tuned. Intensity, selectivity, and confidence stop being adjectives and become a quantity.
A geometry on inputs. The kernel induces a metric, and the induced metric is the object in which “similar input” is defined. The layer partitions input space into prototype basins you can draw.

None of these are added by an explainability method. They are properties of the RKHS, available the moment the unit is a positive-definite kernel and gone the moment it is a pointwise activation.

Feature 3, The geometry survives the layer

A pointwise activation damages the geometry of the representation in ways that are now well documented: ReLU zeros coordinates and collapses rank, saturating activations flatten the pullback metric, and in high dimension these are the rule, not the exception, the full argument is activations are bad for geometry. The damage is structural: a function applied coordinate-wise cannot preserve angles and distances, because it does not know what the coordinates mean together.

The kernel unit does not apply a function to the representation; it measures the representation against centers. Its level sets are distance contours around $W_u$ , and the geometry the layer exposes is the kernel’s induced geometry, informative by construction rather than degenerate by accident. You are not asking a nonlinearity to be selective without warping space, which is the impossible trade the activation is caught in. You are reading space with a ruler.

Feature 4, No activation to choose, and no dead units

The ReLU-versus-GELU-versus-SiLU question disappears, because the nonlinearity is the kernel and the kernel is fixed by the geometry, not the leaderboard. So does the dead-unit pathology: a ReLU unit whose pre-activation is negative over the data has exactly zero gradient and never recovers, a failure mode every large MLP pays for. A Yat unit has no off half-space. Its center always sits somewhere, always has a basin, and always receives a gradient pulling it toward the data it should explain. Units do not die; they migrate.

Feature 5, A capacity functional, so regularization has a target

Kernel machines come with a complexity measure that generalization bounds actually use: the RKHS norm. For the Yat unit that norm is the computable $(\lVert W_u\rVert^2 + b)^2/\varepsilon$ above, and bounding it bounds the function class, the Rademacher complexity of a kernel-weighted readout scales as $(R^2 + b)^2/(\varepsilon n)$ in the data radius $R$ , through a single quantity rather than the product of every layer’s spectral norm. Weight decay on a standard MLP regularizes a proxy (the parameter norm) for a quantity you cannot name (the function’s complexity). Here the quantity has a name, and $\varepsilon$ is a direct dial on it: larger $\varepsilon$ flattens the kernel, lowers the RKHS norm, and widens the receptive field. The kernel form of the unit makes its own regularizer explicit. (The cleanest empirical evidence so far is on the attention side, where the Yat-kernel transformer shows a markedly smaller generalization gap than its softmax baseline; the MLP-side claim is the same mechanism and remains an empirical question.)

Feature 6, The “finite” part: a feature map you can write down

This is where the word finite earns its place. The Yat numerator $(W_u^\top x + b)^2$ is a polynomial kernel of degree two, and a degree-two polynomial kernel has an exact, finite-dimensional feature map $\phi_{\text{poly}}(x)$ of size $O(d^2)$ , the monomials $x_i x_j$ , the linears $x_i$ , and a constant. There is no infinite series to truncate, unlike the Gaussian RBF whose feature map is genuinely infinite. The IMQ denominator is not finite-rank, but it is exactly the regime where positive random features and quadrature work well, the same machinery (Rahimi & Recht, 2007; Choromanski et al., 2021) that makes linear-time attention possible, and that I built and verified on the random-feature side in the attention-kernel companion. So the whole unit is linearizable: a map $\phi$ with $k_{b,\varepsilon}(W_u, x) \approx \phi(W_u)^\top \phi(x)$ , exact on the polynomial part and tightly approximable on the IMQ part. A kernel machine you can linearize is a kernel machine you can scale, and “finite” is the property that lets you.

This is the kernel trick made spatial. The exact degree-2 feature map $\phi(x) = [x_1, x_2, x_1^2, x_2^2, x_1x_2, 1]$ turns a curved boundary into a flat one: a hyperplane in feature space is a conic in input space. Below, the flat separator is fitted on the engine (a least-squares solve in feature space), and the data is lifted into 3-D by its quadratic score. Raise the lift and the classes rise apart until a single flat plane slices between them; look at the floor and that same flat plane is a circle (rings) or a hyperbola (XOR). The spiral is the honest counter-example: degree two is not enough, so it never fully separates.

Feature 7, Out-of-distribution inputs get a smaller answer, not a confident one

Drive a ReLU MLP far off its training distribution and it extrapolates linearly to infinity: it returns a large, confident, usually wrong number, because a half-space has no boundary. A kernel that is peaked at its centers does the opposite. An input far from every prototype evaluates small against all of them, the unnormalized response decays, and the normalized readout has no strong mass to assign, a built-in “I am not near anything I know.” Abstention and OOD-awareness are not a separate detection head; they are what a local kernel does when you leave the data.

Feature 8, One primitive for the whole network

Attention is already a kernel smoother, a Nadaraya–Watson estimator over tokens (Nadaraya, 1964; Watson, 1964; and the attention post). If the MLP unit is the same Yat kernel, then the feed-forward block and the attention block are the same primitive applied to different index sets: centers-and-inputs in the MLP, queries-and-keys in attention. One geometry, one attribution story, one capacity functional, end to end, instead of two unrelated mechanisms (a kernel smoother for mixing, an opaque MLP for transformation) that you have to interpret with two different toolkits. The GOAT family of Yat-kernel attention layers is the other half of this picture; together they describe a network whose every layer is readable for the same reason.

What it costs

This is not free, and I want to be as honest as the attention piece was: I am not claiming the Yat MLP is a drop-in win over the transformer FFN at scale. That is an empirical question. Three real costs and caveats:

Distances are not quite a matmul. $\lVert x - W_u\rVert^2 = \lVert x\rVert^2 + \lVert W_u\rVert^2 - 2 W_u^\top x$ reduces to a matmul plus two norm vectors, so the asymptotics match a linear layer, but the constant and the memory traffic are higher than a single GEMM.
Do not stack kernels. The canonical block is $x \to \text{Yat} \to \text{Linear}$ , not $\text{Yat}(\text{Yat}(\cdot))$ ; composing kernel units directly is a known anti-pattern in this line of work. The linear readout after the kernel is doing necessary work.
The scalars and centers need care. $b$ and $\varepsilon$ are admissibility-constrained (parameterize them through a softplus so they stay non-negative), and prototype initialization matters more than weight initialization in a ReLU net, because a center is a location, not just a scale.

The activation was always the missing kernel

Strip the construction down and the claim is small. A neuron’s job is to answer “how much does this input look like the thing I am tuned for?” The linear-plus-activation neuron answers it badly, a direction and a clamp, with no center, no metric, no honest attribution, and then we spend the rest of the stack, and an entire subfield, trying to recover the geometry the activation threw away. Put a finite, positive-definite kernel where the activation was and you do not add locality, explainability, geometry, and capacity control. They were always what a kernel is; the activation was the thing that hid them. As I put it before: the opacity of the standard MLP was never the price of expressivity. It was the price of giving up the kernel.

The Yat kernel and its universality are from Bouhsine (2026). The kernel-unit construction and the prototype-versus-direction argument are developed in what an MLP knows; the geometric cost of activations in activations are bad for geometry; the RKHS-explainability argument in attention is explainable because it is a kernel. The Neural Tangent Kernel is Jacot et al. (2018) and the lazy regime Chizat et al. (2019); random features Rahimi & Recht (2007) and Choromanski et al. (2021).

References

Mercer, J. (1909). Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations. Philosophical Transactions of the Royal Society A 209, 415–446.
Kimeldorf, G., Wahba, G. (1971). Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications 33(1), 82–95.
Schölkopf, B., Herbrich, R., Smola, A. J. (2001). A Generalized Representer Theorem. COLT 2001, 416–426.
Rahimi, A., Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS 2007.
Cho, Y., Saul, L. K. (2009). Kernel Methods for Deep Learning. NeurIPS 2009.
Jacot, A., Gabriel, F., Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS 2018.arXiv:1806.07572
Chizat, L., Oyallon, E., Bach, F. (2019). On Lazy Training in Differentiable Programming. NeurIPS 2019.arXiv:1812.07956
Nadaraya, E. A. (1964). On Estimating Regression. Theory of Probability & Its Applications 9(1), 141–142.
Watson, G. S. (1964). Smooth Regression Analysis. Sankhyā: The Indian Journal of Statistics, Series A 26(4), 359–372.
Choromanski, K., et al. (2021). Rethinking Attention with Performers. ICLR 2021.arXiv:2009.14794
Bouhsine, T. (2026). A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance. arXiv:2605.03262

First, the kernel you do not mean#

The unit#

Feature 1, Lazy loading: the layer evaluates only what is near#

Feature 2, Explainability is not bolted on, it is the kernel theory#

Feature 3, The geometry survives the layer#

Feature 4, No activation to choose, and no dead units#

Feature 5, A capacity functional, so regularization has a target#

Feature 6, The “finite” part: a feature map you can write down#

Feature 7, Out-of-distribution inputs get a smaller answer, not a confident one#

Feature 8, One primitive for the whole network#

What it costs#

The activation was always the missing kernel#