RAY — Random Approximation of the ⵟ-kernel

\[ k_{\text{ⵟ},b}(w,x)=\frac{(w^\top x + b)^2}{\lVert w-x\rVert^2 + \varepsilon},\qquad b\ge 0,\ \varepsilon>0 \]

A squared inner product (alignment) over an inverse squared distance (proximity) — neither shift-invariant nor a dot-product kernel.

Abstract

Bernstein–Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor by sampling its one-dimensional Bernstein–Widder scale and applying Gaussian random Fourier features. The feature dimension becomes \(Dm\), free of the \(O(d^2)\) size of the exact-modulation feature. Keeping the modulation exact is the analyzable limit (\(m\to\infty\)): there we prove unbiasedness, an exact variance for the flat estimator, an intrinsic-dimension matrix-Bernstein operator-norm bound, and a deterministic kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator (RAY) inherits the same operator-norm guarantee plus a single additive sketch term. The motivating instance is the biased ⵟ-kernel above, whose family span contains the inverse-multiquadric kernel; experiments validate the construction off-sphere, isolate the alignment×proximity coupling, and turn RAY into a linear-time, streaming attention primitive that the exact kernel cannot scale to and a landmark method cannot stream.

Why it matters

\(O(NM)\)

linear-time attention vs the \(O(N^2)\) kernel-smoother wall — 137 GB exact at \(N{=}131{,}072\), 1.6 GB for RAY

\(O(d^{-2})\)→ none

sketching removes the \(O(d^2)\) polynomial floor; the feature dimension drops to \(Dm\)

class-level

one estimator, one analysis, for every finite-modulation × completely-monotone-radial kernel

Contributions

i
A construction for a kernel class, not one kernel
Identify Bernstein–Schur kernels and linearize the whole family with one unbiased estimator: keep the finite modulation exact, sample the radial Bernstein–Widder scale, apply random Fourier features.
ii
Sharp variance and optimal allocation
An exact variance for the flat estimator, a proof that one frequency per scale is variance-optimal at fixed budget, and a norm- and bias-free normalized variant.
iii
Operator-norm concentration & KRR stability
An intrinsic-dimension matrix-Bernstein bound governed by the top eigenvalues — not the crude \(N\max_{ij}\) route — and a deterministic relative-spectral kernel-ridge perturbation bound.
iv
RAY: the deployed, doubly-randomized estimator
Sketching the modulation drops the feature dimension to \(Dm\). Conditioning on the sketch carries the operator-norm guarantee over, plus a single tunable additive term.
v
Empirical validation
The \(O(1/\sqrt D)\) rate, the \((R^2+b)^4\) bias law, the off-sphere niche where Nyström degrades with \(d\), a controlled coupled-target preference test, and a linear-time streaming attention primitive.

Results

Off-sphere Gram error and Nyström degradation — **Off-sphere, the key regime.** RAY follows the \(O(1/\sqrt D)\) Monte-Carlo rate at every dimension and stays bounded as \(d\) grows, while uniform and k-means Nyström degrade at fixed landmark count.

**Linear-time streaming attention.** Faithful to exact ⵟ-attention, stable across context length, clearing the \(O(N^2)\) memory wall with a constant-size decode state and an exact causal recurrence.

Operator-norm error of doubly-randomized RAY — **Doubly-randomized operator norm.** The deployed sketched estimator matches the radial \(O(D^{-1/2})\) term plus the conditioned sketch term, as predicted.

Kernel ridge regression downstream — **Downstream KRR.** RAY tracks the exact ⵟ-kernel as the draw budget grows, with an IMQ-RFF ablation isolating the exact alignment numerator.

Citation

@article{bouhsine2026bernsteinschur,
  title   = {Bernstein--Schur Kernels: Random Features by
             Sketched Modulation and Radial Randomization},
  author  = {Bouhsine, Taha},
  year    = {2026},
  note    = {Code: https://github.com/mlnomadpy/ray}
}

Abstract

Why it matters

Contributions

A construction for a kernel class, not one kernel

Sharp variance and optimal allocation

Operator-norm concentration & KRR stability

RAY: the deployed, doubly-randomized estimator

Empirical validation

Results

Citation