RANDOM APPROXIMATION OF THE -KERNEL

Bernstein–Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Taha Bouhsine

Azetta AI

\[ k_{\text{ⵟ},b}(w,x)=\frac{(w^\top x + b)^2}{\lVert w-x\rVert^2 + \varepsilon},\qquad b\ge 0,\ \varepsilon>0 \]
A squared inner product (alignment) over an inverse squared distance (proximity) — neither shift-invariant nor a dot-product kernel.

Abstract

Bernstein–Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor by sampling its one-dimensional Bernstein–Widder scale and applying Gaussian random Fourier features. The feature dimension becomes \(Dm\), free of the \(O(d^2)\) size of the exact-modulation feature. Keeping the modulation exact is the analyzable limit (\(m\to\infty\)): there we prove unbiasedness, an exact variance for the flat estimator, an intrinsic-dimension matrix-Bernstein operator-norm bound, and a deterministic kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator (RAY) inherits the same operator-norm guarantee plus a single additive sketch term. The motivating instance is the biased -kernel above, whose family span contains the inverse-multiquadric kernel; experiments validate the construction off-sphere, isolate the alignment×proximity coupling, and turn RAY into a linear-time, streaming attention primitive that the exact kernel cannot scale to and a landmark method cannot stream.

Why it matters

\(O(NM)\)
linear-time attention vs the \(O(N^2)\) kernel-smoother wall — 137 GB exact at \(N{=}131{,}072\), 1.6 GB for RAY
\(O(d^{-2})\)→ none
sketching removes the \(O(d^2)\) polynomial floor; the feature dimension drops to \(Dm\)
class-level
one estimator, one analysis, for every finite-modulation × completely-monotone-radial kernel

Contributions

Results

Off-sphere Gram error and Nyström degradation
Off-sphere, the key regime. RAY follows the \(O(1/\sqrt D)\) Monte-Carlo rate at every dimension and stays bounded as \(d\) grows, while uniform and k-means Nyström degrade at fixed landmark count.
Linear-time streaming attention
Linear-time streaming attention. Faithful to exact -attention, stable across context length, clearing the \(O(N^2)\) memory wall with a constant-size decode state and an exact causal recurrence.
Operator-norm error of doubly-randomized RAY
Doubly-randomized operator norm. The deployed sketched estimator matches the radial \(O(D^{-1/2})\) term plus the conditioned sketch term, as predicted.
Kernel ridge regression downstream
Downstream KRR. RAY tracks the exact -kernel as the draw budget grows, with an IMQ-RFF ablation isolating the exact alignment numerator.

Citation

@article{bouhsine2026bernsteinschur,
  title   = {Bernstein--Schur Kernels: Random Features by
             Sketched Modulation and Radial Randomization},
  author  = {Bouhsine, Taha},
  year    = {2026},
  note    = {Code: https://github.com/mlnomadpy/ray}
}