0% found this document useful (0 votes)

11 views35 pages

Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond

This survey provides a comprehensive overview of random features for kernel approximation, highlighting their algorithms, theoretical results, and practical applications over the past decade. It discusses the motivations behind random features, their sampling schemes, and evaluates their performance on large-scale datasets. Additionally, the survey addresses the relationship between random features and deep neural networks, aiming to guide practitioners and stimulate future research in the field.

Uploaded by

Li Luo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views35 pages

Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond

Uploaded by

Li Luo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

1

Random Features for Kernel Approximation: A

Survey on Algorithms, Theory, and Beyond
Fanghui Liu, Xiaolin Huang, Yudong Chen, Johan A.K. Suykens

Abstract—The class of random features is one of the most popular techniques to speed up kernel methods in large-scale problems.
Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of
work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the
connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features
from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are
summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data.
arXiv:2004.11154v5 [stat.ML] 11 Jul 2021

Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a
high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation
of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and
prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized
deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between
current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users’ guide for practitioners
interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope
that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.

Index Terms—random features, kernel approximation, generalization properties, over-parameterized models

F
1 I NTRODUCTION

K ERNEL methods [1], [2], [3] are one of the most powerful
techniques for nonlinear statistical learning problems with
a wide range of successful applications. Let x, x0 ∈ X ⊆ Rd
methods provide a data dependent vector representation of the
kernel. Random Fourier features (RFF) [9], on the other hand, is
a typical data-independent technique to approximate the kernel
be two samples and φ : X → H be a nonlinear feature map function using an explicit feature mapping. This survey focuses
transforming each element in X into a reproducing kernel Hilbert on RFF and its variants for kernel approximation. RFF applies in
space (RKHS) H, in which the inner product between φ(x) and particular to shift-invariant (also called “stationary”) kernels that
φ(x0 ) endowed by H can be computed using a kernel function satisfy k(x, x0 ) = k(x − x0 ). By virtue of the correspondence
k(·, ·) : Rd × Rd → R as hφ(x), φ(x0 )iH = k(x, x0 ). In practice, between a shift-invariant kernel and its Fourier spectral density, the
the kernel function k is directly given to obtain the inner product kernel can be approximated by k(x, x0 ) ≈ hϕ(x), ϕ(x0 )i, where
hφ(x), φ(x0 )iH instead of finding the explicit expression of φ, the explicit mapping ϕ : Rd → Rs is obtained by sampling from
which is known as the kernel trick. Benefiting from this scheme, a distribution defined by the inverse Fourier transform of k . To
kernel methods are effective for learning nonlinear structures but scale kernel methods in the large sample case (e.g., n d), the
often suffer from scalability issues in large-scale problems due to number of random features s is often taken to be larger than the
high space and time complexities. For instance, given n samples in original sample dimension d but much smaller than the sample size
the original d-dimensional space X , kernel ridge regression (KRR) n to achieve computational efficiency in practice.1 Accordingly, the
requires O(n3 ) training time and O(n2 ) space to store the kernel random features model is a powerful tool for scaling up traditional
matrix, which is often computationally infeasible when n is large. kernel methods [10], [11], neural tangent kernel [12], [13], [14],
To overcome the poor scalability of kernel methods, kernel graph neural networks [15], [16], and attention in Transformers
approximation is an effective technique by constructing an explicit [17], [18]. Interestingly, the random features model can be viewed
mapping Ψ : Rd → Rs such that k(x, y) ≈ Ψ(x)> Ψ(y). By as a class of two-layer neural networks with fixed weights in the
doing so, an efficient linear model can be well learned in the first layer. This connection has important theoretical implications.
transformed space with O(ns2 ) time and O(ns) memory while It has been observed that deep neural networks (DNNs) exhibit
retaining the expressive power of nonlinear methods. A series of certain intriguing phenomena such as the ability to fit random labels
kernel approximation algorithms have been developed in the past [19] and double descent [20] in the over-parameterized regime.
years, including divide-and-conquer approaches [4], [5], [6], greedy Theoretical results [13], [21], [22], [23] for random features can be
basis selection techniques [7] and Nyström methods [8]. These leveraged to explain these phenomena and provide an analysis of
two-layer over-parameterized neural networks. Partly due to its far-
F. Liu and J.A.K. Suykens are with the Department of Electrical
Engineering (ESAT-STADIUS), KU Leuven, B-3001 Leuven, Belgium (email: reaching repercussions, the seminal work by Rahimi and Recht on
{fanghui.liu;johan.suykens}@esat.kuleuven.be). RFF [9] won the Test-of-Time Award in the Thirty-first Advances
X. Huang is with Institute of Image Processing and Pattern Recognition, and also in Neural Information Processing Systems (NeurIPS 2017).
with Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai
200240, P.R. China (e-mail: [email protected]).
Y. Chen is with School of Operations Research and Information Engineering, 1. Random features model can be regarded as an over-parameterized model
Cornell University, Ithaca, NY 14850 USA (e-mail: [email protected]). allowing for s n, refer to Section 7 for details.
2

RFF spawns a new direction for kernel approximation, and Table 1

the past ten years has witnessed a flurry of research papers Commonly used parameters and symbols.
devoted to this topic. On the algorithmic side, subsequent work
has focused on improving the kernel approximation quality [24], Notation Definition Notation Definition
[25] and decreasing the time and space complexities [26], [27]. n number of samples d feature dimension
Implementation of RFF has in fact been taken to the hardware level s number of random features λ regularization parameter
[28], [29]. On the theoretical side, a series of works aim to address k (original) kernel function k̃ (approximated) kernel function
the following two key questions: ωi random feature βλ optimization variable
x data point y label vector
1) Approximation: how many random features are needed to ς Gaussian kernel width σ activation function
ensure high quality of kernel approximation? ei standard basis vector u u := hx, x0 i/(kxkkx0 k)
2) Generalization: how many random features are needed to K (original) kernel matrix K
f (approximated) kernel matrix
incur no loss in the expected risk of a learned estimator? τ τ := x − x0 τ τ := kτ k2
Z random feature matrix W transformation matrix
Here “no loss” means how large s should be for the (approximated) fρ target function ` loss function
kernel estimator with s random features to be almost as good as fz,λ (original) empirical functional f˜z,λ (approximated) functional
the exact one. Much research effort has been devoted to this Ez empirical risk E expected risk
direction, including analyzing the kernel approximation error lλ (ω) ridge leverage function dλK effective dimension (matrix)
Σ integral operator N (λ) effective dimension (operator)
(the first question above) [9], [30], and studying the risk and ⊗ tensor product . ≤ with a constant C times
generalization properties (the second question above) [11], [31]. α convergence rate for λ γ rate for effective dimension
Increasingly refined and general results have been obtained over
the years. In the Thirty-sixth International Conference on Machine
Learning (ICML 2019), Li et al. [31] were recognized by the Section 6. In Section 7, we discuss recent results on random
Honorable Mentions (best paper finalist) for their unified theoretical features in over-parameterized regimes. The paper is concluded in
analysis of RFF. Section 8 with a discussion on future directions.
RFF has proved effective in a broad range of machine learning
tasks. Given its remarkable empirical success and the rapid growth 2 P RELIMINARIES AND TAXONOMIES
of the related literature, we believe it is desirable to have a In this section, we introduce preliminaries on the problem setting
comprehensive overview on this topic summarizing the progress and theoretical foundation of random features. We then present a
in algorithm design and applications, and elucidating existing taxonomy of existing random features based algorithms, which sets
theoretical results and their underlying assumptions. With this goal the stage for the subsequent discussion. A set of commonly used
in mind, in this survey we systematically review the work from parameters is summarized in Table 1.
the past ten years on the algorithms, theory and applications of
random features methods. Figure 1 shows a schematic overview 2.1 Problem Settings
of the history of the work on random features in recent years. The Consider the following standard supervised learning setup. Let
main contributions of this survey include: X ⊂ Rd be a compact metric space of samples, and Y = {−1, 1}
1) We provide an overview of a wide range of random (in classification) or Y ⊆ R (in regression) be the label space.
features based algorithms, re-organize the formulation of We assume that a sample set {zi = (xi , yi )}n i=1 is drawn from
representative approaches under a unifying framework for a non-degenerate unknown Borel probability measure ρ on X ×
a direct understanding and comparison. Y . Let H be a RKHS endowed with a positive definite kernel
2) We summarize existing theoretical results on the kernel function k(·, ·), and K = [k(xi , xj )]n i,j=1 be the kernel matrix
approximation error measured in various metrics, as well associatedRwith the samples. The target function of ρ is defined as
as results on generalization risk of kernel estimators. The fρ (x) = Y ydρ(y|x) for x ∈ X , where ρ(·|x) is the conditional
underlying assumptions in these results are discussed in detail. distribution of y given x. The typical empirical risk minimization
In particular, we (partly) answer an open question in this topic: problem is considered as
why good kernel approximation performance cannot lead to
( n )
1X 2
good generalization performance? fz,λ := argmin ` yi , f (xi ) + λkf kH , (1)
f ∈H n i=1
3) We systematically evaluate and compare the empirical per-
formance of representative random features based algorithms where ` : Y × Y → R is a loss function and λ ≡ λ(n) > 0 is a
under different experimental settings. regularization parameter. In learning theory, one typically assumes
4) We discuss recent research trends on (high dimensional) that limn→∞ λ(n) = 0 and adopts λ := n−α with α ∈ (0, 1].
random features in over-parameterized settings for understand- The loss function `(y, f (x)) in Eq. (1) measures the quality
ing generalization properties of over-parameterized neural of the prediction f (x) at x ∈ X with respect to the observed
networks as well as the gaps in existing theoretical analysis. response y . Popular choices of ` include the squared loss
We view this topic as a promising research direction. `(y, f (x)) = (y − f (x))2 in kernel ridge regression (KRR)
The remainder of this paper is organized as follows. Section 2 and the hinge loss `(y, f (x)) = max(0, 1 − yf (x)) in support
presents the preliminaries and a taxonomy of random features vector machines (SVM), etc. For a given `, the empirical
based algorithms. We review data-independent algorithms in risk functional on the sample set is defined as Ez (f ) =
1 Pn
Section 3 and data-dependent approaches in Section 4. In Section 5, n i=1 `(yi , f (xi )),R and the corresponding expected risk is
we survey existing theoretical results on kernel approximation defined as E(f ) = X ×Y `(y, f (x))dρ. The statistical theory
and generalization performance. Experimental comparisons of of supervised learning in an approximation theory view aims to
representative random features based methods are given in understand the generalization property of fz,λ as an approximation
3

poly kernel [34], addtive kernel [35]

structural: Fastfood [36]
quasi-Monte Carlo: QMC [37]
doubly stochastic [38]
sperical poly kernel [39]
circulant: SCRF [40]
orthogonality: ORF, SORF [24],P -model [41]
denisty transformation: MM [42], SSF [43]
Algorithm
deterministic quadratures: GQ, SGQ [26]
stochastic quadratures: SSR [27]
RFF [9]: Test-of-Time award in
NeurIPS2017 kernel alignment: KP-RFF [44]
quantization [45], [46]
weighted RFF: [32]
leverage score: LS-RFF [31]
histogram
kernels: [33] surrogate leverage score: [47], [48]
kernel learning with RFF

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 future

kk − k̃k∞ : kk − k̃k∞ : [30], [49]

[9]
kk − k̃kLr : [49]
Lipschitz conutinous:
ω ∼ p(·) [32] kk − k̃k∞ : [50]

∆-spectral approximation and empirical risk: [51]

excess risk: ω ∼ q(·) [52]

Theory √
KRR: Ω( n log n) [53]

SVM: ≤ Ω(n) [11]

(∆1 , ∆2 )-spectral approximation: [45]

KRR, SVM: ω ∼ p(·), ω ∼ q(·) [31] (best paper

finalist in ICML2019)

RFF in DNNs: double descent [20],

NTK [12], lottery ticket hypothesis
[54]

Figure 1. Timeline of representative work on the algorithms and theory of random features.

of the true target function fρ , which can be quantified by the excess as a finite-dimensional empirical risk minimization problem
risk E(fz,λ ) − E(fρ ), or the estimation error kfz,λ − fρ k2 in an n
1X
appropriate norm k · k. βλ := argmin ` yi , β> ϕ(xi ) + λkβk22 . (3)
Using an explicit randomized feature mapping ϕ : Rd → Rs , β∈Rs n i=1
one may approximate the kernel function k(x, x0 ) by k̃(x, x0 ) =
For example, in least squares regression where ` is the squared
hϕ(x), ϕ(x0 )i. In this case, the approximate kernel k̃(·, ·) defines 2
loss, the first term in problem (3) is equivalent to ky − Zβk2 ,
an RKHS H e (not necessarily contained in the RKHS H associated
>
where y = [y1 , y2 , · · · , yn ] is the label vector and Z =
with the original kernel function k ). With the above approximation,
[ϕ(x1 ), · · · , ϕ(xn )]> ∈ Rn×s is the random feature matrix.
one solves the following approximate version of problem (1):
( n ) This is a linear ridge regression problem in the space spanned
1X by the random features, with the optimal prediction given by
˜ 2

fz,λ := argmin ` yi , f (xi ) + λkf kHe . (2) f˜z,λ (x0 ) = β> 0 0
n i=1 λ ϕ(x ) for a new data point x , where βλ has the
f ∈H > −1 >
explicit expression βλ = (Z Z+nλI) Z y . For classification,
e

By the representer theorem [1], the above problem can be rewritten one may take the sign to output the binary classification labels. Note
4

that problem (3) also corresponds to fixed-size kernel methods with where Λi ≥ 0 are the Fourier coefficients,
Yi,j is
the spherical
feature map approximation (related to Nyström approximation) and i + d − 3
harmonics, and N (d, i) = 2i+d−2 .
estimation in the primal [2]. i d−2
Note that, dot product kernels defined in Rd do not belong
2.2 Theoretical Foundation of Random Features
to the rotation-invariant class. Nevertheless, by virtue of the
The theoretical foundation of RFF builds on Bochner’s celebrated neural network structure under Gaussian initialization, some dot
characterization of positive definite functions. product kernels defined on Rd are able to benefit from the sampling
Theorem 1 (Bochner’s Theorem [55]). A continuous and shift- framework q behind RFF. Given a two-layer network of the form
f (x; θ) = 2s sj=1 aj σ(ω>
P
invariant function k : Rd × Rd → R is positive definite if and only j x) with s neurons (notation chosen
if it can be represented as to be consistent with the number of random features), for some
Z activation function σ and x ∈ Rd , when ω ∼ N (0, Id ) are fixed
0
k(x − x ) = exp iω> (x − x0 ) µk (dω) , and only the second layer (parameters a) are optimized3 , this
Rd
actually corresponds to random features approximation
where µk is a positive finite measure on the frequencies ω .
According to Bochner’s theorem, the spectral distribution µk k (x, x0 ) = Eω∼N (0,Id ) [σ(ω > x)σ(ω> x0 )] , (6)
of a stationary kernel k is the finite measure induced by a Fourier
transform. By setting k(0) = 1, we may normalize µk to a where the nonlinear activation function σ(·) depends on the
probability density p (the Fourier transform associated with k ), kernel type such that ϕ(xi ) := σ(W xi ) in Eq. (5), by denoting
hence the transformation matrix W := [ω1 , ω2 , · · · , ωs ]> ∈ Rs×d .
Z The formulation in (6) is quite general to cover a series of
k(x − x0 ) = exp iω> (x − x0 ) µ(dω)

kernels by various activation functions. For example, if we
Rd (4)
take σ(x) = [cos(x), sin(x)]> , Eq. (6) corresponds to the
= Eω∼p(·) exp(iω> x) exp(iω> x0 )∗ ,

Gaussian kernel, which is the standard RFF model [9] for
where the symbol z ∗ denotes the complex conjugate of z . The Gaussian kernel approximation. If we consider the commonly
kernels used in practice are typically real-valued and thus the used ReLU activation σ(x) = max{0, x} in neural networks,
imaginary part in Eq. (4) can be discarded. According to Eq. (4), Eq. (6) corresponds to the first order arc-cosine √ kernel, termed as
0 1
RFF makes use of the standard Monte Carlo sampling scheme to k(x, x ) ≡ κ1 (u) = π (u(π −arccos(u))+ 1 − u2 ) by setting
approximate k(x, x0 ). In particular, one uses the approximation u := hx, x0 i/(kxkkx0 k). If the Heaviside step function σ(x) =
1
2 (1 + sign(x)) is used, Eq. (6) corresponds to the zeroth order
k(x, x0 ) = Eω∼p [ϕp (x)> ϕp (x0 )] ≈ k̃p (x, x0 ) := ϕp (x)> ϕp (x0 ) arc-cosine kernel, termed as k(x, x0 ) ≡ κ0 (u) = 1− 1 arccos(u)
π
with the explicit feature mapping2 by setting u := hx, x0 i/(kxkkx0 k), refer to arc-cosine kernels
[60] for details. If we take other activation functions used in neural
1 > > >
ϕp (x) := √ exp(−iω1 x), · · · , exp(−iωs x)] , (5) networks, e.g., erf activations [61], GELU [62] in Eq. (6), such
s two-layer neural network also corresponds to a kernel. In this case,
where {ωi }i=1 are sampled from p(·) independently of the the standard RFF model is still valid (via Monte Carlo sampling
s

training set. Consequently, the original kernel matrix K = from a Gaussian distribution) for these non-stationary kernels.
[k(xi , xj )]n×n can be approximated by K ≈ K fp = Zp Z> Further, for a fully-connected deep neural network (more than
p
>
with Zp = [ϕp (x1 ), · · · , ϕp (xn )] ∈ R n×s
. It is convenient two layers) and fixed random weights before the output layer,
to introduce the shorthand >
zp (ωi , xj ) := exp(−iωi xj ) such if the hidden layers are wide enough, one can still approach a
√ > kernel obtained by letting the widths tend to infinity [63], [64].
that ϕp (x) = 1/ s[zp (ω1 , x), · · · , zp (ωs , x)] . With this
0
notation, the approximate kernel k̃p (x, x ) can be rewritten as If both intermediate layers and the output layer are trained by
0 1 Ps
k̃p (x, x ) = s i=1 zp (ωi , x)zp (ωi , x ).0 (stochastic) gradient descent, for the network f (x; θ) with large
A similar characterization in Eq. (4) is available for rotation- enough s, the model remains close to its linearization around its
invariant kernels, where the Fourier basis functions are spherical random initialization throughout training, known as lazy training
harmonics [56], [57]. Here rotation-invariant kernels are dot- regime [65]. Learning is then equivalent to a kernel method with
product kernels defined on the unit sphere X = S d−1
:= {x ∈ another architecture-specific kernel, known as neural tangent kernel
d
R : kxk2 = 1}, and can be represented as a non-negative (NTK, [12]). Interestingly, NTK for two-layer ReLU networks
expansion with spherical harmonics, refer to the book [58] for [66] can be constructed by arc-cosine kernels, i.e., k (x, x0 ) =
0
details. kxkkx k[uκ0 (u) + κ1 (u)]. In fact, there is an interesting line of
work showing insightful connections between kernel methods and
Theorem 2 ([56]). A rotation-invariant continuous function k : (over-parameterized) neural networks, but this is out of scope of
Sd−1 × Sd−1 → R is positive definite if and only if it has a this survey on random features. We suggest the readers refer to
d
symmetric non-negative expansion into spherical harmonics Y`,m , some recent literature [13], [67], [68] for details.
that is Further, if we consider the general non-stationary kernels [69],
X∞ N (d,i)
X [70], the spectral representation can be generalized by introducing
k(x, x0 ) ≡ k(hx, x0 i) = Λi Yi,j (x)Yi,j (x0 ) , two random variables ω and ω 0 .
i=0 j=1

2. The subscript in ϕp , Zp , kp (and other symbols) emphasizes the 3. Extreme learning machine [59] is another structure in a two-layer
dependence on the distribution p(·) but can be omitted for notational simplicity. feedforward neural network by randomly hidden nodes.
5

Theorem 3. ( [70], [71], [72]) A non-stationary kernel k is 2.4 Taxonomy of random features based algorithms
positive definite if and only if it admits The key step in random features based algorithms is constructing
Z
>
the following random feature mapping
k(x, x0 ) = exp i ω> x − ω 0 x0 µΨk (dω, dω 0 ) ,
Rd ×Rd 1
ϕ(x) := √ a1 exp(−iω> >
1 x), · · · , as exp(−iωs x)]
>
(7)
s
where µΨk is the Lebesgue-Stieltjes measure on the product space
Rd × Rd associated to some positive definite function Ψk (ω, ω 0 ) so as to approximate the integral (4). Random features {ωi }si=1 can
with bounded variations. be formulated as the feature matrix W = [ω1 , · · · , ωs ]> ∈ Rs×d
in a compact form. Existing algorithms differ in how they select
the points ωi (the transformation matrix W ) and weights ai .
2.3 Commonly used kernels in Random Features Figure 2 presents a taxonomy of some representative random
Random features based algorithms often consider the following features based algorithms. They can be grouped into two categories,
kernels: data-independent algorithms and data-dependent algorithms, based
i) Gaussian kernel: Arguably the most important member of on whether or not the selection of ωi and ai is independent of the
shift-invariant kernels, the Gaussian kernel is given by training data.
Data-independent random features based algorithms can be
kx − x0 k22

further categorized into three classes according to their sampling
k(x, x0 ) = exp − ,
2ς 2 strategy:
i) Monte Carlo sampling: The points {ωi }si=1 are sampled
where ς > 0 is the kernel width. The density (see Theorem 1 from p(·) in Eq. (4) (see the red box in Figure 2). In particular,
or Eq. (6)) associated with the Gaussian kernel is Gaussian ω ∼ to approximate the Gaussian kernel by RFF [9], these points are
N (0, ς −2 Id ). sampled from the Gaussian distribution p = N (0, ς −2 Id ), with
ii) arc-cosine kernels: This class admits Eq. (6) by sampling the weights being equal, i.e., ai ≡ 1 in Eq. (7). To reduce the
from the Gaussian distribution N (0, Id ), that can be connected storage and time complexity, one may replace the dense Gaussian
to a two-layer neural networks with various activation functions. matrix in RFF by structural matrices; see, e.g., Fastfood [36]
Following [60], we define the b-order arc-cosine kernel by using Hadamard matrices as well as its general version P -model
1 [41]. An alternative approach is using circulant matrices; see,
k(x, x0 ) = kxkb2 kx0 kb2 Jb (θ) , e.g., Signed Circulant Random Features (SCRF) [40]. To improve
π
the approximation quality, a simple and effective approach is to
> 0
where θ = cos−1 kxkx2 kx
x
0k and use an `2 -normalization scheme, which leads to Normalized RFF
2
(NRFF) [79]. Another powerful technique for variance reduction
b is orthogonalization to decrease the randomness in Monte Carlo
π−θ

1 ∂
Jb (θ) = (−1)b (sin θ)2b+1 . sampling. Typical algorithms include Orthogonal Random Features
sin θ ∂θ sin θ (ORF) [24] by employing an orthogonality constraint to the random
Most common in practice are the zeroth order (b = 0) and first Gaussian matrix, Structural ORF (SORF) [24], [91], and Random
order (b = 1) arc-cosine kernels. The zeroth order kernel is given Orthogonal Embeddings (ROM) [80].
explicitly by ii) Quasi-Monte Carlo sampling: This is a typical sampling
θ scheme in sampling theory [92] to reduce the randomness in
k(x, x0 ) = 1 − , Monte Carlo sampling for variance reduction. It can significantly
π
improve the convergence of Monte Carlo sampling by virtue of
and the first order kernel is a low-discrepancy sequence t1 , t2 , · · · , ts ∈ [0, 1]d instead of
1 a uniform sampling sequence over the unit cube to construct
k(x, x0 ) = kxk2 kx0 k2 (sin θ + (π − θ) cos θ) . the sample points; see the integral representation in the green
π
box in Figure 2. Based on this representation, it can be used
iii) Polynomial kernel: This is a widely used family of non- for kernel approximation, as conducted by [25]. Subsequently,
stationary kernels given by Lyu [43] proposes Spherical Structural Features (SSF), which
generates asymptotically uniformly distributed points on Sd−1
k(x, x0 ) = (1 + hx, x0 i)b , to achieve better convergence rate and approximation quality.
The Moment Matching (MM) scheme [42] is based on the same
where b is the order of the polynomial.
integral representation but uses a d-dimensional refined uniform
Note that, dot-product kernel defined in Rd admit neither
sampling sequence {ti }si=1 instead of a low discrepancy sequence.
spherical harmonics nor Eq. (6). As a result, random features
Strictly speaking, SSF and MM go beyond the QMC framework.
for polynomial kernels work in different theoretical foundations
Nevertheless, these methods share the same integration formulation
and settings, and have been studied in a smaller number of
with QMC over the unit cube and thus we include them here for a
papers, including Maclaurin expansion [34], the tensor sketch
streamlined presentation.
technique [73], [74], and oblivious subspace embedding [75], [76].
iii) Quadrature based methods: Numerical integration tech-
Interestingly, if the data are `2 normalized, dot product kernels
niques can be also used to approximate the integral representation
defined in Rd can be transformed as stationary but indefinite (real,
in Eq. (4). These techniques may involve deterministic selection of
symmetric, but not positive definite) on the unit sphere4 . The related
random features based algorithms under this setting provide biased 4. This setting cannot ensure the data are i.i.d on the unit sphere, which is
estimators [39], [77], or unbiased estimation [78]. different from the setting of previously discussed rotation invariant kernels.
6
 
  (
 
structural: Fastfood [36], P -model [41], SORF [24]

 
 

 acceleration circulant: SCRF [40]

 
 


 
 
 
 i) Monte Carlo sampling

 
 (


`2 normalization: NRFF [79]
 
 
variance reduction

 
 


orthogonal constraint: ORF [24], ROM [80]
 
 


 
 

data-independent

 






  QMC [37]

ii) Quasi-Monte Carlo sampling







 structural spherical feature: SSF [43]
 
  moment matching: MM [42]


 


 
 (

deterministic quadrature rules: GQ, SGQ [26]
 


iii) Quadrature rules





 stochastic spherical-radial rule: SSR [27]

 






 leverage score sampling: LSS-RFF
 [31], fast leverage score approximation [47], [48], [81]
 








  weighted random features: [32], [82] for RFF, [25] for QMC, [26] for GQ







 re-weighted random features kernel alignment: KA-RFF [83] and KP-RFF [44]
 
  compressed low-rank approximation: CLR-RFF [46]


 
data-dependent

 

 one-stage: ( [84] via generative models

  

 
 
 

 
 kernel learning by random features joint optimization: [85], [86]

 
 two-stage
spectral learning in mixture models: [87], [88], [89], [90]

 
 


 
 
 
others: quantization [45]; doubly stochastic [38]

 

ii) Quasi-Monte Carlo sampling
i) Monte Carlo sampling • QMC
k(x − x0 ) = p(ω) exp iω> (x − x0 ) dω • SSF
R
• variance reduction
Rd
• MM
• acceleration
k(x − x0 ) = [0,1]d exp i(x−x0 )>Φ−1 (t) dt
R

Qd R ∞ (j)
k(x−x0 ) = (j) (j) (j) 0(j)

j=1 −∞ pj ω exp iω (x − x ) dω
R∞ r2
k(x − x0 ) = (2π)−d/2 e− 2 |r|d−1 g(ru)drdu
R
data-dependent Ud 0
iii) Quadrature rules
• random features
• GQ, SGQ
k(x, x0 ) = Rd q(ω) p(ω) > 0
R
selection/learning q(ω) exp iω (x − x ) dw • SSR
• leverage score

Figure 2. A taxonomy of representative random features based algorithms.

the points and weights, e.g., by using Gaussian Quadrature (GQ) RFF [32], [82], weighted QMC [25], and weighted GQ [26]. Note
[26] or Sparse Grids Quadrature (SGQ) [26] over each dimension that these algorithms directly learn the weights of pre-given random
(their integration formulation can be found in the first blue box in features. Another line of methods re-weight the random features
Figure 2). The selection can also be randomized. For example, in the using a two-step procedure: i) “up-projection”: first generate a
work [27], the d-dimensional integration in Eq. (4) is transformed large set of random features {ωi }Ji=1 ; ii) “compression”: then
to a double integral, and then approximated by using the Stochastic reduce these features to a small number (e.g., 102 ∼ 103 ) in a
Spherical-Radial (SSR) rule (see the second blue box in Figure 2). data-dependent manner, e.g., by using kernel alignment [83], kernel
Data-dependent algorithms use the training data to guide the polarization [44], or compressed low-rank approximation [46].
selection of points and weights in the random features for better
iii) Kernel learning by random features: This class of methods
approximation quality and/or generalization performance. These
aim to learn the spectral distribution of kernel from the data so as
algorithms can be grouped into three classes according to how the
to achieve better similarity representation and prediction. Note that
random features are generated.
these methods learn both the weights and the distribution of the
i) Leverage score sampling: Built upon the importance sampling features, and hence differ from the other random features selection
framework, this class of algorithm replaces the original distribution methods mentioned above, which assume that the candidate features
p(ω) by a carefully chosen distribution q(ω) constructed using are generated from a pre-given distribution and only learn the
leverage scores [51], [52] (see the yellow box in Figure 2). The weights of these features. Representative approaches for kernel
representative approach in this class is Leverage Score based RFF learning involve a one-stage [84] or two-stage procedure [85], [86],
(LS-RFF) [31], and its accelerated version [47], [81]. [87], [88], [89], [90]. From a more general point of view, the
ii) Re-weighted random feature selection: Here the basic idea aforementioned re-weighted random features selection methods
is to re-weight the random features by solving a constrained can also be classified into this class. Since these methods belong to
optimization problem. Examples of this approach include weighted the broad area of kernel learning instead of kernel approximation,
7

we do not detail them in this survey. RFF, especially in high dimensions (e.g., d ≥ 1000). In particular,
Besides the above three main categories, other data-dependent W used in Eq. (8) is substituted by
approaches include the following. i) Quantization random features
1
[45]: Given a memory budget, this method quantizes RFF for WFastfood = B1 HGΓHB2 , (9)
Gaussian kernel approximation. A key observation from this work ς
is that random features achieve better generalization performance where H is the Walsh-Hadamard matrix admitting fast multiplica-
than Nyström approximation [93] under the same memory space. tion in O(d log d) time, and Γ ∈ {0, 1}d×d is a permutation
ii) Doubly stochastic random features [38]: This method uses matrix that decorrelates the eigen-systems of two Hadamard
two sources of stochasticity, one from sampling data points by matrices. The three diagonal random matrices G, B1 and B2
stochastic gradient descent (SGD), and the other from using RFF are specified as follows: G has independent Gaussian entries
to approximate the kernel. This scheme has been used for Kernel drawn from N (0, 1); B1 is a random scaling matrix with
PCA approximation [94], and can be further extended to triply (B1 )ii = kωi k2 /kGkF , which encodes the spectral properties
stochastic scheme for multiple kernel approximation [95]. of the associated kernel; B2 is a binary decorrelation matrix
with independent random {±1} entries. FastFood is an unbiased
estimator, but may have a larger variance than RFF:
3 DATA - INDEPENDENT A LGORITHMS
6τ 4 τ2

−τ 2
In this section, we discuss data-independent algorithms in a V [Fastfood ] − V [RFF ] ≤ e + ,
s 3
unified framework based on the transformation matrix W , that
plays an important role in constructing the mapping ϕ(·) in which converges at an O(1/s) rate.
Eq. (7) and determining how well the estimated kernel converges P -model [41]: A general version of Fastfood, the P -model
to the actual kernel. Table 2 reports various random features constructs the transformation matrix as
based algorithms in terms of the class of kernels they apply
to (in theory) as well as their space and time complexities for WP = [g> P1 , g> P2 , · · · , g> Ps ]> ∈ Rs×d ,
computing the feature mapping W x for a given x ∈ X . In
where g is a Gaussian random vector of length a and P = {Pi }si=1
Table 2, we also summarize the variance reduction properties of
is a sequence of a-by-d matrices each with unit `2 norm columns.
these algorithms, i.e., whether the variance of the resulting kernel
Fastfood can viewed as a special case of the P -model: the matrix
estimator is smaller than the standard RFF. Before proceeding,
HG in Eq. (9) can be constructed by using a fixed budget of
we introduce some notations and definitions. When discussing
randomness in g and letting each Pi be a random diagonal matrix
a stationary kernel function k(x, x0 ) = k(x − x0 ), we use
with diagonal entries of the form Hi1 , Hi2 , . . . , Hid . The P -model
the convenient shorthands τ := x − x0 and τ := kτ k2 .
is unbiased and its variance is close to that of RFF with an O(1/d)
For a random features algorithm A with frequencies {ωi }si=1
convergence rate
sampled from a distribution µ(·)P,s we define>
its
expectation
E(A) := E[k(τ )] = Eω∼µ Ps 1/s i=1 cos(ω
i τ ) and variance V[P -model] − V[RFF] = O (1/d) .
V[A] := V[k(τ )] = V 1s i=1 cos(ω> τ ) .
SCRF [40]: It accelerates the construction of random features
3.1 Monte Carlo sampling based approaches by using circulant matrices. The transformation matrix is

We describe several representative data-independent algorithms WSCRF = [ν ⊗ C(ω1 ), ν ⊗ C(ω2 ), · · · , ν ⊗ C(ωt )]> ∈ Rtd×d ,
based on Monte Carlo sampling, using the Gaussian kernel
k(x, x0 ) = k(τ ) = exp(−kτ k22 /2ς 2 ) as an example. Note that where ⊗ denotes the tensor product, ν = [ν1 , ν2 , . . . , νd ] is a
these algorithms often apply to more general classes of kernels, as Rademacher vector with P(νi = ±1) = 1/2, and C(wi ) ∈ Rd×d
summarized in Table 2. is a circulant matrix generated by the vector ωi ∼ N (0, ς −2 Id ).
RFF [9]: For Gaussian kernels, RFF directly samples the Thanks to the circulant structure, we only need O(s) space to store
random features from a Gaussian distribution (corresponds to the feature mapping matrix WSCRF with s = td. Note that C(wi )
the inverse Fourier transform): {ω}si=1 ∼ p(ω). In particular, the can be diagonalized using the Discrete Fourier Transform for ωi .
corresponding transformation matrix is given by SCRF is unbiased and has the same variance as RFF.
The above three approaches are designed to accelerate the
1 computation of RFF. We next overview representative methods that
WRFF = G, (8)
ς aim for better approximation performance than RFF.
NRFF [79]: It normalizes the input data to have unit `2 norm
where G ∈ Rs×d is a (dense) Gaussian matrix with Gij ∼
before constructing the random Fourier features. With normalized
N (0, 1). For other stationary kernels, the associated p(·) corre-
data, the Gaussian kernel can be computed as
sponds to the specific distribution given by the Bochner’s Theorem.
For example, the Laplacian kernel k(τ ) = exp(−kτ k1 /ς) is 1

x> x0
!
0
associated with a Cauchy distribution. RFF is unbiased, i.e., k(x, x ) = exp − 2 1 − ,
ς kxk2 kx0 k2
E[RFF] = exp(−kτ k22 /2ς 2 ), and the corresponding variance is
2
V[RFF] = (1 − e−τ )2 /2s. which is related to the normalized linear kernel [39], [79]. Albeit
Fastfood [36]: By observing the similarity between the dense simple, NRFF is effective in variance reduction and in particular
Gaussian matrix and Hadamard matrices with diagonal Gaussian satisfies
matrices, Le et al. [36] firstly introduce Hadamard and diagonal 1 −τ 2 2
matrices to speed up the construction of dense Gaussian matrices in V[NRFF] = V[RFF] − e (3 − e−2τ ) .
4s
8

Table 2
Comparison of different kernel approximation methods on space and time complexities to obtain W x.

Method Kernels (in theory) Extra Memory Time Lower variance than RFF
Random Fourier Features (RFF) [9] shift-invariant kernels O(sd) O(sd) -
Quasi-Monte Carlo (QMC) [37] shift-invariant kernels O(sd) O(sd) Yes
Normalized RFF (NRFF) [79] Gaussian kernel O(sd) O(sd) Yes
Moment matching (MM) [42] shift-invariant kernels O(sd) O(sd) Yes
Orthogonal Random Feature (ORF) [24] Gaussian kernel O(sd) O(sd) Yes
Fastfood [36] Gaussian kernel O(s) O(s log d) No
Spherical Structured Features (SSF) [43] shift and rotation-invariant kernels O(s) O(s log d) Yes
Structured ORF (SORF) [24], [91] shift and rotation-invariant kernels O(s) O(s log d) Unknown
Signed Circulant (SCRF) [40] shift-invariant kernels O(s) O(s log d) The same
P-model [41] shift and rotation-invariant kernels O(s) O(s log d) No
Random Orthogonal Embeddings (ROM) [80] rotation-invariant kernels O(d) O(d log d) Yes
Gaussian Quadrature (GQ), Sparse Grids Quadrature (SGQ) [26] shift invariant kernels O(d) O(d log d) Yes
Stochastic Spherical-Radial rules (SSR) [27] shift and rotation-invariant kernels O(d) O(d log d) Yes

ORF [24]: It imposes orthogonality on random features for the SORF [24], [91]: It replaces the random orthogonal matrices
Gaussian kernel and has the transformation matrix used in ORF by a class of structured matrices akin to those in
1 Fastfood. The transformation matrix of SORF is given by
WORF = SQ , √
ς d
WSORF = HD1 HD2 HD3 , (12)
where Q is a uniformly distributed random orthogonal matrix, and ς
S is a diagonal matrix with diagonal entries sampled i.i.d from where H is the normalized Walsh-Hadamard matrix and Di ∈
the χ-distribution with d degrees of freedom. This orthogonality Rd×d , i = 1, 2, 3 are diagonal “sign-flipping” matrices, of
constraint is useful in reducing the approximation error in random which each diagonal entry is sampled from the Rademacher
features. It is also considered in [96] for unifying orthogonal Monte distribution. Bojarski et al. [91] consider more general structures
Carlo methods. ORF is unbiased and with variance bounded by for the three blocks of matrices HDi in Eq. (12). Note that
2
first block HD1 satisfies
!
1 g(τ ) (d − 1)e−τ τ 4 each
h block plays a different
i role. The 2
V[ORF] − V[RFF] ≤ − , Pr kHD1 xk∞ > log √d ≤ 2de− log8 d
for any x ∈ Rd with
s d 2d d
log2 d
2 kxk2 = 1, termed as (log d, 2de− 8 )-balanced, hence no
where we have g(τ ) = eτ τ 8 + 6τ 6 + 7τ 4 + τ 2 /4

2 dimension carries too much of the `2 norm of the vector x. The
+eτ τ 4 τ 6 + 2τ 4 /2d. It can be seen that the variance

second block HD2 ensures that vectors are close to orthogonal.
reduction property Var[ORF] < Var[RFF] holds under some The third block HD3 controls the capacity of the entire structured
conditions, e.g., when d is large and τ is small. For a large d, the transform by providing a vector of parameters. SORF is not an
ratio of the variances of ORF and RFF can be approximated by unbiased estimator of the Gaussian kernel, but it satisfies an
2
V[ORF] (s − 1)e−τ τ 4 asymptotic unbiased property
≈1− 2 . (10)
6τ
V[RFF] d 1 − e−τ 2 2
E [SORF] − e−τ /2 ≤ √ .
d
Choromanski et al. [97] further improve the variance bound to
ROM [80]: It generalizes SORF to the form
V[RFF]−V[ORF] = √ t
p dY
J d −1 ( R12 + R22 τ )Γ(d/2)
" #
s−1 WROM = HDi ,
ER1 ,R2 2
p d
ς i=1
s ( R12 + R22 τ /2) 2 −1 (11)
" #2 where H can be the normalized Hadamard matrix or the Walsh
s−1 J d −1 (R1 τ ) Γ(d/2) matrix, and Di is the Rademacher matrix as defined in SORF.
− ER1 2
d
−1
,
s (R1 τ /2) 2 Theoretical results in [80] show that the ROM estimator achieves
variance reduction compared to RFF. Interestingly, odd values of t
where Jd is the Bessel function of the first kind of degree d, and yield better results than even t. This provides an explanation for
R1 and R2 are two independent scalar random variables satisfying why SORF chooses t = 3.
ω1 = R1 v and ω2 = R2 v with ω1 , ω2 ∼ N (0, ς −2 Id ) and LP-RFF [45]: It attempts to quantize RFF with the Gaussian kernel
v ∼ Unif(S d−1 ). According to Eq. (11), the property V[ORF] < under a memory budget, i.e., mapping each s-dimensional
V[RFF] holds asymptotically in cases: i) a fixed d and a small p p p random
1 feature zp (x) = 2/s cos(WRFF x) ∈ [− 2/s, 2/s] to an
enough τ with E[kωk42 ] ≤ ∞; ii) a fixed τ < 4√ with some s-dimensional low precision vector with b bits
constant c and a large d, in which case we have
c
p via apstochastic
rounding scheme. They divide the interval [− 2/s, 2/s] into
s − 1 1 τ 4 − τ22 2pb − 1 equal-sized sub-intervals and randomly round each value
1
V[RFF] − V[ORF] = e ς +O . 2/s cos(ωi x) to either the top or bottom of the corresponding
s 2d ς 2 d
9

sub-interval. Strictly speaking, this method does not belong to data- SSF [43]: It improves the space and time complexities of
independent algorithms. But we put it here for ease of description QMC for approximating shift- and rotation-invariant kernels.
as this approach directly quantizes RFF. More importantly, a SSF generates points {v1 , v2 , · · · , vs } asymptotically uniformly
new insight demonstrated by this method is that, under the same distributed on the sphere Sd−1 , and construct the transformation
memory budget, random features based algorithms achieve better matrix as
generalization performance than Nyström approximation [93].
Apart from the stochastic quantization scheme used in [45], the WSSF = [Φ−1 (t)v1 , Φ−1 (t)v2 , · · · , Φ−1 (t)vs ]> ∈ Rs×d ,
authors of [98] employ Lloyd-Max quantization with a smaller where Φ−1 (t) uses the one-dimensional QMC point. The structure
number of bits. matrix V := [v1 , v2 , · · · , vs ] ∈ S(d−1)×s has the form
From the above description, one can find that orthogonalization
is a typical operation for variance reduction, e.g., ORF/SORF/ROM. 1 Re FΛ − Im FΛ
V = ∈ Rd×s ,
d/2 Im FΛ Re FΛ
p
Here we take the Gaussian kernel as an example to illustrate
insights of such scheme. By sampling {ωi }si=1 ∼ N (0, ς −2 Id ), d s
the used Gaussian distribution is isotropic and only depends on the where FΛ ∈ C 2 × 2 consists of a subset of the rows of the discrete
s s
norm kωk2 instead of ω . The used orthogonal operator makes the Fourier matrix F ∈ C 2 × 2 . The selection of d2 rows from F is
direction of ωi orthogonal to each other (that means more uniform) done by minimizing the discrete Riesz 0-energy [99] such that the
while retaining its norm unchanged5 , which leads to decrease the points spread as evenly as possible on the sphere.
randomness in Monte Carlo sampling, and thus achieve variance MM [42]: It also uses the transformation matrix in Eq. (14),
reduction effect. If we attempt to directly decrease the randomness but generates a d-dimensional uniform sampling sequence {ti }si=1
in Monte Carlo sampling, QMC is a powerful way to achieve this by a moment matching scheme instead of using a low discrepancy
goal and can then be used to kernel approximation. This is another sequence as in QMC. In particular, the transformation matrix is
line of random features with variance reduction illustrated as below. e −1 (t1 ), Φ
e −1 (t2 ), · · · , Φ
e −1 (ts )]> ∈ Rs×d , (15)
WMM = [Φ

3.2 Quasi-Monte Carlo Sampling where one uses moment matching to construct the vectors
e −1 (ti ) = Ã−1 (Φ−1 (ti ) − µ̃) with the sample mean µ̃ =
Φ
Here we briefly review methods based on quasi-Monte Carlo 1 Ps −1
sampling (QMC) [37], spherical structured feature (SSF) [43], s i=1 Φ (ti ) and the square root of the sample covariance
and moment matching (MM) [42]. These three methods achieve a matrix Ã satisfying ÃÃ> = Cov(Φ−1 (ti ) − µ̃).
lower variance or approximation error than RFF. Strictly speaking, To achieve the target of variance reduction, both orthogonaliza-
the later two algorithms do not belong to the quasi-Monte Carlo tion in Monte Carlo sampling and QMC based algorithms share
sampling framework. However, SSF and MM share the same the similar principle, namely, generating random features that
integration formulation with QMC and thus we introduce them are as independent/uniform as possible. To be specific, QMC
here for simplicity. and MM are able to generate more uniform data points to avoid
Classical Monte Carlo sampling generates a sequence of undesirable clustering effect, see Figure 1 in [37]. Likewise, SSF
samples randomly and independently, which may lead to an aims to generate asymptotically uniformly distributed points on
undesired clustering effect and empty spaces between the samples the sphere Sd−1 , which attempts to encode more information
[92]. Instead of fully random samples, QMC [37] outputs low- with fewer random features, and thus allows for variance
discrepancy sequences. A typical QMC sequence has a hierarchical reduction. In sampling theory, QMC can be further improved
structure: the initial points are sampled on a coarse scale whereas by an sub-grouped based rank-one lattice construction [100] for
the subsequent points are sampled more finely. For approximating computational efficiency, which can be used for the subsequent
a high-dimensional integral, QMC achieves an asymptotic error kernel approximation.
convergence rate of = O((log s)d /s), which is faster than the
O(s−1/2 ) rate of Monte Carlo. Note however that QMC often 3.3 Quadrature based Methods
requires s to be exponential in d for the improvement to manifest. Quadrature based methods build on a long line of work on
QMC [37]: It assumes that p(·) factorizes with respect to the numerical quadrature for estimating integrals. In quadrature
Qd
dimensions, i.e., p(x) = j=1 pj (xj ), where each pj (·) is a methods, the weights are often non-uniform, and the points are
univariate density function. QMC generally transforms an integral usually selected using deterministic rules including Gaussian
on Rd in Eq. (4) to one on the unit cube [0, 1]d as quadrature (GQ) [26], [101] and sparse grids quadrature (SGQ) [26].
Deterministic rules can be extended to their stochastic versions. For
Z
k(x − x0 ) = exp i(x − x0 )> Φ−1 (t) dt ,

(13) example, Munkhoeva et al. [27] explore the stochastic spherical-
[0,1]d
radial (SSR) rule [102], [103] in kernel approximation. Below we
where Φ−1 (t) = Φ−1 −1 d

1 (t1 ) , · · · , Φd (td ) ∈ R with Φj being briefly review these methods.
the cumulative distribution function (CDF) of pj . Accordingly, by GQ [26]: It assumes that the kernel function k factorizes
generating a low discrepancy sequence t1 , t2 , · · · , ts ∈ [0, 1]d , with respect to the dimensions and the corresponding distribution
the random frequencies can be constructed by ωi = Φ−1 (ti ). The p(ω) = p([ω (1) , ω (2) , . . . , ω (d) ]> ) in Eq. (4) is sub-Gaussian.
corresponding transformation matrix for QMC is Therefore, the d-dimenionsal integral in Eq. (4) can be factorized
as
WQMC = [Φ−1 (t1 ), Φ−1 (t2 ), · · · , Φ−1 (ts )]> ∈ Rs×d . (14)
Yd Z ∞
0 (j) (j) (j) 0(j)
(j)
5. In fact, while orthogonalization only makes the direction of {ωi }si=1 k(x−x ) = pj ω exp iω (x − x ) dω .
more uniform, one can make the length kωi k2 uniform by sampling from the j=1 −∞
cumulative distribution function of kωk2 . (16)
10

Since each of the factors is a one-dimensional integral, we can the above two rules, we have the SSR rule. Accordingly, the
approximate them using a one-dimensional quadrature rule. For transformation matrix of SSR is
example, one may use Gaussian quadrature [101] with orthogonal
(QV )>

polynomials: WSSR = ϑ ⊗ ∈ R2(d+1)×d ,
−(QV )>
∞ L
with ϑ = [ϑ1 , ϑ2 , · · · , ϑs ] and V = [v1 , v2 , · · · , vd+1 ], where
Z X
p(ω) exp(iω(x − x0 ))dω ≈ aj exp iγ> 0

j (x − x ) , ϑ ∼ χ(d + 2) and {vi }d+1
−∞ j=1 i=1 are the vertices of a unit regular
(17) d-simplex, which is randomly rotated by Q. To get s features, one
where L is the accuracy level and each γj is a univariate point may stack s/(2d + 3) independent copies of W as suggested by
associated with the weight aj . For a third-point rule with the [27]. Finally, the feature mapping by SSR is given by
points {−p̂1 , 0, p̂1 } and their associated weights (â1 , â0 , â1 ), the ϕ(x) = [a0 g(0), a1 g(w> >
1 x), · · · , as g(ws x)] ,
transformation matrix WGQ ∈ Rs×d has entries Wij following the r q
distribution Pd+1
where a0 = 1 − j=1 ρd2 , aj = ρ1j 2(d+1) d
for j ∈ [s], and
j

Pr (Wij = ±p̂1 ) = â1 , Pr (Wij = 0) = â0 , ∀i ∈ [s], j ∈ [d] . wj is the j -th element of the stacked W .
In general, according to Eq. (6), kernel approximation
In general, the univariate Gaussian quadrature with L quadrature by random features is actually a d-dimensional integration
points is exact for polynomials up to (2L − 1) degrees. The approximation problem in mathematics. Sampling methods and
multivariate Gaussian quadrature is exact for all polynomials of quadrature based rules are two typical classes of approaches for
the form ω1i1 ω2i2 · · · ωdid with 1 ≤ ij ≤ 2L − 1; however the total high-dimensional integration approximation. Efforts on quadrature
number of points s = Ld scales exponentially with the dimension based methods focus on developing a high-accuracy, mesh-free,
d and thus this method suffers from the curse of dimensionality. efficiency rule, e.g., [105], [106]. Note that, if the integrand
SGQ [26]: To alleviate the curse of dimensionality, SGQ g(ω) := σ(ω > x)σ(ω> x0 ) in the integration representation (6)
uses the Smolyak rule [104] to decrease the needed number belongs to a RKHS, the above quadrature rules can be termed as
of points. Here we consider the third-degree SGQ using the kernel-based quadrature, e.g., Bayesian quadrature [107], [108]
symmetric univariate quadrature points {−p̂1 , 0, p̂1 } with weights and leverage-score quadrature [52]. This approach is in essence
(â1 , â0 , â1 ): different from the previously studied quadrature rules in functional
d
spaces, model formulation, and scope of application.
X
k(x, x0 ) ≈ (1−d+dâ0 ) g(0) + â1

g (p̂1 ej )+g (−p̂1 ej ) , 4 DATA - DEPENDENT ALGORITHMS
j=1
Data-dependent approaches aim to design/learn the random features
where the function g(ω) := σ(ω > x)σ(ω> x0 ) is given by Eq. (6), using the training data so as to achieve better approximation quality
and ei is the d-dimensional standard basis vector with the i-th or generalization performance. Based on how the random features
element being 1. The corresponding transformation matrix is are generated, we can group these algorithms into three classes:
leverage score sampling, random features selection, and kernel
WSGQ = [0d , p̂1 e1 , · · · , p̂1 ed , −p̂1 e1 , · · · , −p̂1 ed ]> ∈ R(2d+1)×d, learning by random features.

which leads to the explicit feature mapping

4.1 Leverage score based sampling
ϕ(x) = [â0 g(0), â1 g(w> >
2 x), · · · , g(w2d+1 x)] , Leverage score based approaches [31], [47], [109] are built on the
importance sampling framework. Here one samples {wi }si=1 from
where wi is the i-th row of WSGQ . Note that SGQ generates
a distribution q(w) that needs to be designed, and then uses the
2d + 1 points. To obtain a dimension-adaptive feature mapping,
following feature mapping in Eq. (5):
Dao et al. [26] propose to subsample the points according to the
distribution determined by their weights such that the mapping
s s !>
1 p (w1 ) −iw>1 x p (ws ) −iw>s x
feature dimension is equal to s. ϕq (x) = √ e ,··· , e .
SSR [27]: It transforms Eq. (6) (actually a d-dimensional
s q (w1 ) q (ws )
(19)
integral) to a double integral over a hyper-sphere and the real 0
Consequently, we have the approximation k(x, x ) =
line. Let ω = ru with u> u = 1 for r ∈ [0, ∞), we have Ps
Ew∼q [ϕq (x)> ϕq (x0 )]p ≈ i=1 zq (w i , x)z q (w i , x 0
) , where
Z ∞
Cd zq (wi , xj ) := p(wi )/q(wi )zp (wi , xj ). Thus, the
Z
r2
0
k(x − x ) = e− 2 |r|d−1 g(ru)drdu , (18) kernel matrix K can be approximated by Kq = Zq Z>
2 Sd−1 −∞ q,
where Zq := [ϕq (x1 ), · · · , ϕq (xn )]> ∈ Rn×s .
where the integrand is g(ω) := σ(ω > x)σ(ω> x0 ) given in Denoting by zq,wi (X) the i-th column of Zq , we have
Eq. (6) and Cd := (2π)−d/2 . The inner integral in Eq. (18) K = Ew∼p [zp,w (X)z> p,w (X)] = Ew∼q [zq,w (X)zq,w (X)].
>
can be approximated by stochastic radial rules of degree 2l + 1, To design the distribution q , one makes use of the ridge leverage
Pl g(ρi )+g(−ρi )
i.e., R(g) = i=0 ŵi 2 . The outer integral over function [51], [52] in KRR:
the d-sphere in Eq. (18) can be approximated by stochastic
Pq lλ (ωi ) = p(ωi )z>
p,ωi (X)(K + nλI)
−1
zp,ωi (X) , (20)
spherical rules: SQ (g) = j=1 w
e j g (Qu j ), where Q is a
random orthogonal matrix and w ej are stochastic weights whose where λ is the KRR regularization parameter. Define
distributions are such that the rule is exact for polynomials of degree Z
dλK := lλ (ω)dω = tr K(K + nλI)−1 .

q and gives unbiased estimate for other functions. Combining (21)
Rd
11

The quantity dλK n determines the number of independent random features, so that the kernel matrix matches the target kernel
parameters in a learning problem and hence is referred to as the yy> . Problem (24) can be efficiently solved via bisection over a
number of effective degrees of freedom [110], [111]. With the above scalar dual variable, and an -suboptimal solution can be found in
notation, the distribution q designed in [51] is given by O(J log(1/)) time.
KP-RFF (Kernel Polarization-RFF) [44]: It first generates a
lλ (ω) lλ (ω)
q(ω) := R = λ . (22) large number of random features by RFF and then selects a subset
lλ (ω)dω dK from them using an energy-based scheme
Compared to standard Monte Carlo sampling for RFF, leverage n
1X
score sampling requires fewer Fourier features and enjoys nice S̃(ω) = yi zp (xi , ω) .
theoretical guarantees [31], [51] (see the next section for details). n i=1
Note that q(ω) can be also defined by the integral operator [52], PJ
Further, the quantity (1/J) i=1 S̃ 2 (ωj ) can be associated with
[112] rather than the Gram matrix used above, but we do not
kernel polarization for {wi }Ji=1 sampled from p(ω). Accordingly,
strictly distinguish these two cases. The typical leverage score
the top s random features with the top |S̃(·)| values are selected as
based sampling algorithm for RFF is illustrated in [31] as below.
the refined random features. This algorithm can in fact be regarded
LS-RFF (Leverage Score-RFF) [31]: It uses a subset of data to
as a version of the kernel alignment method for generating random
approximate the matrix K in Eq. (21) so as to compute dλK . LS-
features.
RFF needs O(ns2 + s3 ) time to generate refined random features,
CLR-RFF (Compression Low Rank-RFF) [46]: It first gener-
which can be used in KRR [31] and SVM [11] for prediction.
ates a large number of random features and then selects a subset
SLS-RFF (Surrogate Leverage Score-RFF) [47]: To avoid
from them by approximately solving the optimization problem
inverting an s × s matrix in LS-RFF, SLS-RFF designs a simple
but effective surrogate leverage function 1 2
min 2
ZJ Z> J −Z eJ (a)Z eJ (a)> =
a∈RJ :kak0 ≤s n F
(25)

1 >
Lλ (w) = p(w)z> (X) yy + nI zp,w (X) ,
h i
p,w > >
n2 λ E i.i.d.
i,j ∼ [J]
ϕp (x i ) ϕ p (x j ) − ϕ
e p (x i ) ϕ
ep (xj ) ,
(23)
where the additional term nI and the coefficient 1/(n2 λ) in where ϕp (x) ∈ RJ uses J random features, and ϕ
ep (x) is
Eq. (23) ensure that Lλ is a surrogate function that upper bounds 1 >
the function lλ in Eq. (20). One then samples random features ep (x) := √ a1 exp(−iω>
ϕ 1 x), · · · , aJ exp(−iωJ x)
>
,
L (ω) J
from the surrogate distribution Q(w) = R Lλλ(ω)dω , which has
the same time complexity O(ns2 ) as RFF. SLS-RFF and can be which leads to Z eJ (a) = [ϕ ep (x1 ), ϕep (x2 ), · · · , ϕ
ep (xn )] ∈
applied to KRR [47] and Canonical Correlation Analysis [109]. Rn×J . We can construct a Monte-Carlo estimate of the opti-
Note that leverage scores sampling is a powerful tool used in mization objective function in Eq. (25) by sampling some pairs
i.i.d.
sub-sampling algorithms for approximating large kernel matrices i, j ∼ [J]. Therefore, this scheme focuses on a subset of pairs,
with theoretical guarantees, in particular in Nyström approximation. instead of the all data pairs, by seeking a sparse weight vector a
Research on this topic mainly focuses on obtaining fast leverage with only s nonzero elements. The problem of building a small,
score approximation due to inversion of an n-by-n kernel matrix, weighted subset of the data that approximates the full dataset,
e.g., two-pass sampling [113] (LS-RFF belongs to this), online is known as the Hilbert coreset construction problem. It can be
setting [114], path-following algorithm [81], or developing various approximately solved by greedy iterative geodesic ascent [116]
surrogate leverage score sampling based algorithms [47], [48], or Frank-Wolfe based methods [117]. Another way to obtain the
[109]. compact random features is using Johnson-Lindenstrauss random
projection [118] instead of the above data-dependent optimization
4.2 Re-weighted random features scheme.
Here we briefly review three re-weighted methods: KA-RFF [83]
4.3 Kernel learning by random features
by kernel alignment, KP-RFF [44] by kernel polarization, and
CLR-RFF [46] by compressed low-rank approximation. This class of approaches construct random features using sophisti-
KA-RFF (Kernel Alignment-RFF) [83]: It pre-computes a large cated learning techniques, e.g., by learning the spectral distribution
number of random features that are generated by RFF, and then of kernel from the data.
select a subset of them by solving a simple optimization problem Representative approaches in this class often involve a one-
based on kernel alignment [115]. In particular, the optimization stage or two-stage process. The two-stage scheme is common when
problem is using random features. It first learns the random features, and then
incorporates them into kernel methods for prediction. Actually, the
n J
X X above-mentioned leverage sampling and random features selection
max yi yj at zp (xi , ωt ) zp (xj , ωt ) , (24)
a∈PJ based algorithms employ this scheme. The algorithm proposed in
i,j=1 t=1
[84] is a typical method for kernel learning by random features.
where J > s is the number of the candidate random features by This method first learns a spectral distribution of a kernel via an
RFF, and a is the weight vector. Here the maximization is over the implicit generative model, and then trains a linear model by these
set of distributions PJ := {a : Df (ak1/J) ≤R c}, where c > 0 learned features.
dP
is a pre-specified constant and Df (P kQ) := f ( dQ )dQ with One-stage algorithms aim to simultaneously learn the spectral
2 2
f (t) = t − 1 is the χ -divergence between the distributions distribution of a kernel and the prediction model by solving a
P and Q (a special case of the f -divergence). Solving the single joint optimization problem or using a spectral inference
problem (24) learns a (sparse) weight vector a of the candidate scheme. For example, Yu et al. [85] propose to jointly optimize
12
  improved in [11], [31], [52] under various settings. Note that some
kk − k̃k∞ : [9], [30], [49], [50]

 
results above do not directly apply to the squared loss in KRR,

 


kk − k̃kLr : [30], [49]
 
whose Lipschitz parameter
√ is unbounded. For squared losses, Rudi


 approximation error
∆-spectral approximation: [51], [97]

et al. [53] show that Ω( n log n) random features by√ RFF suffice

 


 
 (∆ , ∆ )-spectral approximation: [45]
1 2 to achieve a minimax optimal learning rate O(1/ n). A more




empirical risk:
[45], [51] ( refined analysis is given in [31] under the p(ω)-sampling and

 
 ω ∼ p(ω): [31], [53] q(ω)-sampling settings.
 squared loss ω ∼ q(ω): [31]

 




 Below we discuss the above theoretical work in more details.
expected risk

 (

ω ∼ p(ω): [31], [119]

 
 Lipschitz continuous ω ∼ q(ω): [11], [31], [52]

 


 
5.1 Approximation error

 

Figure 3. Taxonomy of theoretical results on random features. Table 3 summarizes representative theoretical results on the
convergence rates, the upper bound of the growing diameter, and the
the nonlinear feature mapping matrix W and the linear model resulting sample complexity under different metrics. Here sample
with the hinge loss. The associated optimization problem can be complexity means the number of random features sufficient for
solved in an alternating fashion with SGD. In [86], the kernel achieving a maximum approximation error at most .
alignment approach in the Fourier domain and SVM are combined The first result of this kind is given by Rahimi and Recht
into a unified framework, which can be also solved using an [9], who use a covering number argument to derive a uniform
alternating scheme by Langevin dynamics and projection gradient convergence guarantee as follows. For a compact subset S of Rd ,
descent. Wilson and Adams [87] construct stationary kernels as the let |S| := supx,x0 ∈S kx − x0 k2 be its diameter and consider the
Fourier transform of a Gaussian mixture based on Gaussian process L∞ error kk − k̃k∞ := supx,x0 ∈S |k(x, x0 ) − k̃(x, x0 )|.
frequency functions. This approach can be extended to learning
Theorem 4. [Uniform convergence of RFF [9], [30]] Let S be
with Fastfood [88], non-stationary spectral kernel generalization
a compact subset of Rd with diameter |S|. Then, for a stationary
[70], [71], and the harmonizable mixture kernel [89]. Moreover,
kernel k and its approximated kernel k̃ obtained by RFF, we have
Oliva et al. [90] propose a nonparametric Bayesian model, in
which p(ω) is modeled as a mixture of Gaussians with a Dirichlet 2d
ςp |S| s2
h i d+2
process prior. The parameters of the Gaussian mixture and the Pr kk − k̃k∞ ≥ ≤ Cd exp − ,
classifier/regressor model are inferred using MCMC. 4(d + 2)

where ςp2 = Ep [ω> ω] = tr ∇2 k(0) ∈ O(d), and Cd :=

5 T HEORETICAL A NALYSIS 6d+2 d 2
2 d+2
In this section, we review a range of theoretical results that center
2 d+2 d + d2 d+2 satisfies Cd ≤ 256 in [9] and is
around the two questions mentioned in the introduction and restated further improved to Cd ≤ 66 in [30] by optimization balls of
below: radius in covering number.
1) Approximation: how many random features are needed to According to the above theorem by covering number, with
ensure a high quality estimator in kernel approximation? s := Ω(−2 d log(1/δ)) random features, one can ensure an
2) Generalization: how many random features are needed to uniform approximation error with probability greater than 1 − δ .
incur no loss of empirical risk and expected risk in a learning This result also applies to dot-product kernels by random Maclaurin
estimator? feature maps (see [34, Theorem 8]). The quadrature based algorithm
Figure 3 provides a taxonomy of representative work on these two [27] follows this proof framework, and achieves the same error
questions. bound with a smaller constant than RFF in Theorem 4 by an
For the approximation error, existing work focuses on kk−k̃k∞ extra boundedness assumption. Instead, Fastfood [36] on Gaussian
[9], [30], [49], kk − k̃kLr with 1 ≤ r < ∞ [49], ∆-spectral
p
kernels achieves O( log(d/δ)) times approximation error than
approximation [51], [97], and (∆1 , ∆2 )-spectral approximation RFF due to estimates for ΓHB2 in Eq. (9), which is based on
[45]. For the empirical risk under the fixed design setting, existing concentration inequalities for Lipschitz continuous functions under
work provides guarantees on the expected in-sample predication the Gaussian distribution.
error of the KRR estimator based on ∆-spectral approximation Different from the above results using Hoeffding’s inequality
bounds [51] and (∆1 , ∆2 )-spectral approximation bounds [45]. For for the covering number bound in their proof, Sriperumbudur
the expected risk, a series of works investigate the generalization and Szabó [49] revisit the above bound by refined technique of
properties of methods based on p(ω)-sampling or q(ω)-sampling. McDiarmid’s inequality, symmetrization and bound the expectation
These results cover loss functions with/without Lipschitz continuity of Rademacher average by Dudley entropy bound.
and apply to e.g. KRR [31], [53] and SVM [11], [32], [52] under
different assumptions. Theorem 5 (Theorem 1 in [49]). Under the same assumption of
More specifically, Rahimi and Recht [32] provide the earliest Theorem 4, we have
result on learning with RFF with Lipschitz continuous loss " √ #
functions. Their results imply that Ω(n) random features are h(d, |S|, σp ) + 2
Pr kk − k̃k∞ ≥ √ ≤ e− ,
sufficient to incur no loss of learning √ accuracy. This result is s
improved in [31], which shows that Ω( n log n) random features
or even less suffice for the Gaussian kernel. When using the data- where h(d, |S|, σp ) is an appropriately defined function of d,
dependent sampling {ωi }si=1 ∼ q(ω), the above results are further |S|, and σp . For better comparison, the above inequality can
13

be rewritten as [50] Theorem 5.4] present a non-asymptotic comparison result between

h i RFF and ORF for spectral approximation by virtue of the smallest
Pr kk − k̃k∞ ≥ ≤ [(σp + 1)(2|S| + 1)]1024d singular value of K + nλI .
s2

256d Theorem 8 (Theorem 5.4 in [97]). For the Gaussian kernel, let
exp − + .
2 log(2|S| + 1) ∆ f + λnIn is a ∆
e be the smallest positive number such that K e-
spectral approximation of K + λnIn , where K
f is an approximate
Theorem 5 shows that k̃ is a consistent estimator of k in the
kernel matrix obtained by RFF or ORF. Then, for any > 0 we
topology pof compact convergence as s → ∞ with the convergence
have
rate Op ( s−1 log |S|). Consequently, O(−2 log |S|) random
features suffice to achieve an approximation accuracy. This Pr[∆e > ] ≤ B ,
2
2 σmin
sample complexity bound scales logarithmically with |S|, which
improves upon the O(−2 |S|2 log(|S|/)) bound that follows where B := E[kK f − Kk2 ] and σ 2 is the smallest singular
F min
from [9], [30] (cf. Theorem 4). Apart from the L∞ error value of K + λnIn . In particular, letting B ORF denotes the value
bound, the authors of [49] further derive bounds on the Lr of B for the estimator ORF and B RFF for RFF, we have
1/r
0 0 r 0
R R
error kk − k̃kLr := S S |k(x, x ) − k̃(x, x )| dxdx
 
n 4 kxi −xj k2
for 1 ≤ r < ∞; see Table 3 for a summary. We remark that the s − 1 1 X kxi −x k
j 2 − 2
1
B RFF −B ORF =  e ς2 +O .
L2ρX error bound is also given in [30], though the rate in [49] is s 2d i,j=1 ς2 d
sometimes better in terms of the diameter.
For the Gaussian kernel, the approximation guarantee can be Theorem 8 shows that B RFF > B ORF always holds for the
further improved. In particular, the following theorem gives a Gaussian kernel. To better understand the above upper bound on
probability bound independent of d. Pr[∆e > ], we note that both V[RFF] and V[ORF] are O(1/s),
Theorem 6 (Theorem 1 in [50]). Under the same assumption of hence B = O(n2 /s). Moreover, since the Gaussian kernel has
Theorem 4, for the Gaussian kernel k and its approximation k̃ by exponentially decaying eigenvalues (see Assumption 4), we have
2
RFF, we have σmin = Ω(n2 λ2 ). Therefore, the upper bound of Pr[∆ e > ] is on
2/3 the order of O( sλ1 2 ). With the standard scaling of the regularization
|S| s2 parameter λ = n−α , α ∈ (0, 1], we need s := Ω(n2α ) to get a

h i 3
Pr kk − k̃k∞ ≥ ≤ 1/3 exp − .
s 12 non-trivial upper bound on the probability. When α = 1/2, these
results for RFF and ORF require Ω(n) random Fourier features,
Avron et al. [51] argue that the above point-wise distances
which is somewhat unsatisfactory [31].
kk − k̃k∞ or kk − k̃kLr are not sufficient to accurately measure
The results in Theorem 7 can be improved if we consider data-
the approximation quality. Instead, they focus on the following
dependent sampling, i.e., {ωi }si=1 are sampled from the empirical
spectral approximation criterion.
ridge leverage score distribution q(ω) = lλ (ω)/dλK in Eq. (22)
Definition 1. [∆-spectral approximation [51]] For 0 ≤ ∆ < 1, instead of the standard p(ω).
a symmetric matrix A is a ∆-spectral approximation of another
symmetric matrix B , if (1 − ∆)B A (1 + ∆)B , where Theorem 9 (Lemma 6 in [51]). Let k be a shift-invariant kernel
A B indicates that B − A is a semi-positive definite matrix. associated with the empirical ridge leverage score distribution q(ω)
in Eq. (22), ∆ ≤ 1/2 and δ ∈ (0, 1). Assume that kKk2 ≥ λ and
According to this definition, ZZ> + λIn is ∆-spectral {ωi }si=1 ∼ q(ω). If the total number of random features satisfies
approximation of K + λIn if
8
(1 − ∆) (K + λIn ) ZZ> + λIn (1 + ∆) (K + λIn ) . s ≥ ∆−2 dλK log 16dλK /δ ,
3
The follow theorem gives the number of random features s that then
are sufficient to guarantee ∆-spectral approximation. h i
Pr (1 − ∆) (K + λIn ) ZZ> + λIn (1 + ∆) (K + λIn )
Theorem 7 (Theorem 7 in [51]). Let k be a shift-invariant kernel
−3s∆2

and its associated probability distribution p(ω) (i.e., the Fourier λ
≥ 1 − 16dK exp ≥ 1−δ.
transform of k ), ∆ ≤ 1/2, δ ∈ (0, 1), and nλ := n/λ. Assume 8dλK
that kKk2 ≥ λ and {ωi }si=1 ∼ p(ω). If the total number of Theorem 9 shows that if we sample using the ridge leverage
random features satisfies function, then Ω(dλK log dλK ) random features, which is less than
8 Ω(nλ log dλK ), suffice for spectral approximation of K .
s ≥ ∆−2 nλ log 16dλK /δ ,
3 The authors of [45] generalize the notion of ∆-spectral
then approximation to (∆1 , ∆2 )-spectral approximation.
h i
Pr (1−∆) (K +λIn ) ZZ> +λIn (1 + ∆) (K + λIn ) Definition 2 ((∆1 , ∆2 )-spectral approximation [45]). For
∆1 ,∆2 ≥ 0, a symmetric matrix A is a (∆1 , ∆2 )-
−3s∆2

λ
≥ 1 − 16dK exp ≥ 1−δ. spectral approximation of another symmetric matrix B , if
8nλ (1 − ∆1 )B A (1 + ∆2 )B .
Theorem 7 states that Ω(nλ log dλK ) random features are This definition is motivation by the argument that the quantities
sufficient to guarantee ∆-spectral approximation by the matrix ∆1 and ∆2 in the upper and lower bounds may have different
Bernstein concentration inequality and effective degree of freedom, impact on the generalization performance. Using this definition,
where nλ := n/λ. Under this framework, Choromanski et al. [97, Zhang et al. [45] derive the following approximation guarantees
14

Table 3
Comparison of convergence rates and required random features for kernel approximation error.

Metric Results Convergence rate Upper bound of |S| Required random features s
q q
|S|
Theorem 4 ( [9], [30]) Op |S| log s
s
|S| ≤ Ω s
log s s ≥ Ω d−2 log
q
log |S|
kk − k̃k∞ |S| ≤ Ω(sc )1 s ≥ Ω d−2 log |S|

Theorem 1 in [49] Op s
q
log |S|
|S| ≤ Ω(sc ) s ≥ Ω −2 log |S|

Theorem 1 in [50] (Gaussian kernels) Op s
q
2d log |S|
r
s
s ≥ Ω d−2 log |S|

kk − k̃kLr (1 ≤ r < ∞) Corollary 2 in [49] Op |S| r
s |S| ≤ Ω ( log s)
4d

q
2d r
1
s ≥ Ω d−2 log |S|

kk − k̃kLr (2 ≤ r < ∞) Theorem 3 in [49] Op |S| r
s |S| ≤ Ω s 4d
q
nλ
Theorem 7 in [51] Op s - s ≥ Ω(nλ log dλ
K)
∆-spectral approximation
Theorem 5.4 in [97] (Gaussian kernels) ORFF/ORF 1
sλ2
- s ≥ Ω(n2α )
r !
dλ
Lemma 6 in [51] Oq K
s - s ≥ Ω(dλ λ
K log dK )

q
nλ
(∆1 , ∆2 )-spectral approximation Theorem 2 in [45] OLP s
2
- s ≥ Ω(nλ log dλ
K)

1
c is some constant satisfying 0 < c < 1.
2
LP denotes that {ωi }si=1 are obtained by RFF and then are quantized to a Low-Precision b-bit representation; see [45].

when one quantizes each random Fourier feature ωi to a low- functions with/without Lipschitz continuity and for learning tasks
precision b-bit representation, which allows more features to be including KRR [31], [53] and SVM [11], [32], [52]. Apart from
stored in the same amount of space. supervised learning with random features, results on randomized
nonlinear component analysis refer to [10], random features with
Theorem 10 (Theorem 2 in [45]). Let K f be an s-features b-bit
matrix sketching [120], doubly stochastic gradients scheme [94],
LP-RFF approximation of a kernel matrix K and δ ∈ (0, 1).
statistical consistency [121], [122].
Assume that kKk2 ≥ λ ≥ δb2 = 2/(2b − 1)2 and define a :=
δ2
8 Tr(K + λIn )−1 (K + δb2 In ). For ∆1 ≤ 3/2 and ∆2 ∈ [ λb , 32 ], 5.2.1 Assumptions
if the total number of random features satisfies
Before we detail these theoretical results, we summarize the
8 2 2 a
standard assumptions imposed in existing work. Some assumptions
s ≥ nλ max , 2 log ,
3 ∆1 ∆2 − δb /λ δ are technical, and thus familiarity with statistical learning theory
then (see Section 2.1) would be helpful. We organize these assumptions
h i in four categories as shown in Figure 4, including i) the existence
Pr (1−∆1 ) (K +λIn ) K̃ +λIn (1 + ∆2 ) (K +λIn ) of fρ (Assumption 1) and its stronger version (Assumption 8);

−3s∆21
ii) quality of random features (Assumptions 2, 6, 7); iii)
≥ 1 − a exp noise conditions (Assumptions 3, 9, 10); iv) eigenvalue decay
4nλ (1+2/3∆1 )
(Assumptions 4, 5).
−3s(∆2 −δb2 /λ)2

+ exp . We first state three basic assumptions, which are needed in all
4nλ (1 + 2/3(∆2 −δb2 /λ)) of the (regression) results to be presented.
Theorem 10 shows that when the quantization noise is small Assumption 1 (Existence [53], [123]). In regression task, we
relative to the regularization parameter, using low precision has assume fρ ∈ H.
minimal impact on the number of features required for the
(∆1 , ∆2 )-spectral approximation. In particular, as s → ∞, ∆1 Note that since we consider a potentially infinite dimensional
converges to zero for any precision b, whereas ∆2 converges to a RKHS H, possibly universal [124], the existence of the target
value upper bounded by δb2 /λ. If δb2 /λ ∆2 , using b-bit precision function fρ is not automatic. However, if we restrict to a bounded
has negligible effect on the number of features required to attain subspace of H, i.e., HR = {f ∈ H : kf k ≤ R} with R < ∞
this ∆2 see Table 3 for a summary. fixed a prior, then a minimizer of the risk E(f ) always exists as
long as HR is not universal. If fρ exists, then it must lie in a ball
5.2 Risk and generalization property of some radius Rρ,H . The results in this section do not require
The above results on approximation error are a means to an end. prior knowledge of Rρ,H and they hold for any finite radius.
More directly related to the learning performance is understanding Assumption 2 (Random features are bounded and continuous [53]).
generalization properties of random features based algorithms. To For the shift-invariant kernel k , we assume that ϕ(ω> x) in Eq. (6)
this end, a series of work study the generalization properties of is continuous in both variables and bounded, i.e., there exists κ ≥ 1
algorithms based on p(ω)-sampling and q(ω)-sampling. Under such that |ϕ(ω> x)| < κ for all x ∈ X and ω ∈ Rd .
different assumptions, theoretical results have been obtained for loss
15



 i) regression: fρ ∈ H (Ass. 1)(⇐ source condition (Ass. 8)

bounded and continuous (Ass. 2)


ii) quality of random features


q(ω)-sampling: compatibility condition (Ass. 6) ⇐ optimized distribution (Ass. 7)




 (
regression: boundedness on y (Ass. 3)
iii) noise condition
classification: Massart’s low noise condition (Ass. 9) ⇐ separation condition (Ass. 10)





 (
exponential decay


iv) eigenvalue decays assumption (Ass. 4)


polynomial decay and harmonic decay ⇔ capacity condition (Ass. 5)




Figure 4. Relationship between the needed assumptions. The notation A ⇐ B means that B is a stronger assumption than A.

k(x, ·) due to the self-adjoint property of the Hilbert spaces L2ρX and
X H H [122]. With s random features, the inclusion operator I can
be approximated by the operator A : H e → L2 , (Aβ) =
ρX
ϕ(·) I s
hϕ(·), βiHe , ∀β ∈ R . Figure 5 presents the relationship between
various spaces under different operators.
H
e L2ρX The integral operator Σ plays a significant role in characterizing
A
the hypothesis space. In particular, the decay rate of the spectrum
of Σ quantifies the capacity of the hypothesis space in which
Figure 5. Maps between various spaces.
we search for the solution. This capacity in turn determines the
number of random features required for accurate learning. Rudi
and Rosasco [53] consider the following assumption on Σ.
condition [124], [125]). For any x ∈
Assumption 3 (Bernstein’s
X , we assume E |y|b | x ≤ 12 b!ς 2 B b−2 when b ≥ 2 .

Assumption 5 (Capacity condition [123], [126]). There exist Q >
This noise condition is weaker than the boundedness on y . It 0 and γ ∈ [0, 1] such that for any λ > 0, we have
is satisfied when y is bounded, sub-Gaussian, or sub-exponential. N (λ) := tr (Σ + λI)−1 Σ ≤ Q2 λ−γ .

(26)
In particular, if y ∈ [− 2b , 2b ] almost surely with b > 0, then
Assumption 3 is satisfied with ς = B = b. The effective dimension N (λ) [110] measures the “size” of
The above three assumptions are needed in all theoretical the RKHS, and is in fact the operator form of dλK in Eq. (21).
results for regression presented in this section, so we omit them Assumption 5 holds if the eigenvalues λi of Σ decay as i−1/γ ,
when stating these results. We next introduce several additional which corresponds to the eigenvalue decay of K in Assumption 4
assumptions, which are needed in some of the theoretical results. with γ := 1/(2a) [127]. The case γ = 0 is the more benign
Eigenvalue Decay Assumptions: The following assumption, situation, whereas γ = 1 is the worst case.
which characterizes the “size” of the RKHS H of interest, is often Quality of Random Features: Here we introduce several
discussed in learning theory. technical assumptions on the quality of random features. The
leverage score in Eq. (20) admits the operator form
Assumption 4 (Eigenvalue decays [111]). A kernel matrix 2
K admit the following three types of eigenvalue decays: 1) F∞ (λ) := sup (Σ + λI)−1/2 ϕ(x) , ∀λ > 0 ,
L2ρ
Geometric/exponential decay: λi (K) ∝ n exp(−i1/c ), which ω X

leads to dλK . log(R0 /λ); 2) Polynomial decay: λi (K) ∝ which is also called as the maximum random features dimension
ni−2a , which implies dλK . (1/λ)1/2a ; 3) Harmonic decay: [53]. By defintion we always have N (λ) ≤ F∞ (λ). Roughly
λi (K) ∝ n/i, which results in dλK . (1/λ). speaking, when the random features are “good”, it is easy to
control their leverage scores in terms of the decay of the spectrum
We give some remarks on the above assumption. For shift-
of Σ. Further, fast learning rates using fewer random features can
invariant kernels, if the RKHS is small, the eigenvalues of the
be achieved if the features are compatible with the data distribution
kernel matrix K often admit a fast decay. Consequently, functions
in the following sense.
in the RKHS are smooth enough that a good prediction performance
can be achieved. On the other hand, if the RKHS is large and the Assumption 6 (Compatibility condition [53]). With the above
eigenvalues decay slowly, then functions in the RKHS are not definition of F∞ (λ), assume that there exist % ∈ [0, 1], and F > 0
smooth, which would lead to a sub-optimal error rate for prediction. such that F∞ (λ) ≤ F λ−% , ∀λ > 0.
It can be linked to the integral operator [123], [124] characterizing
It always holds that F∞ (λ) ≤ κ2 λ−1 when z is uniformly
the hypothesis space, defined as Σ : L2ρX → L2ρX such that
bounded by κ. So the worst case is % = 1, which means that the
random features are sampled in a problem independent way. The
Z
(Σg)(x) = k(x, x0 )g(x0 )dρX (x0 ), ∀g ∈ L2ρX . favorable case is % = γ , which means that N (λ) ≤ F∞ (λ) ≤
X
O(n−αγ ). In [11], the authors consider the following assumption.
It is clear that the operator Σ is self-adjoint, positive definite,
and trace-class when k(·, ·) is continuous. This operator can be Assumption 7 (Optimized distribution [11]). The feature mapping
represented as Σ = II∗ in terms of the inclusion operator I : z(ω, x) is called optimized if there is a small constant λ0 such
P∞ λi (Σ)
H → L2ρX , (If ) = f . Here I∗ is the adjoint of I and is given by that for any λ ≤ λ0 , F∞ (λ) ≤ N (λ) = i=1 λi (Σ)+λ .
Z Under the previous definitions, Assumption 7 holds only
I∗ : L2ρX → H, (I∗ f )(·) = k(x, ·)f (x)dρX , when F∞ (λ) = N (λ). This assumption is stronger than the
X
16
√
compatibility condition in Assumption 6. Note that Assumption 7 assumptions and appropriately chosen parameters, Ω( n log n)
is satisfied when sampling from q(ω). random features suffice for KRR to achieve minimax optimal rates.
Source condition on fρ : The following assumption states that
Theorem 11 (Generalization bound; Theorem 3 in [53]). Suppose
fρ has some desirable regularity properties.
that Assumption 8 (source condition) holds with r ∈ [ 12 , 1], As-
Assumption 8 (Source condition [53], [128]). There exist 1/2 ≤ sumption 6 (compatibility) holds with % ∈ [0, 1], and Assumption 5
r ≤ 1 and g ∈ L2ρX such that fρ (x) = (Σr g)(x) almost surely. (capacity) holds with γ ∈ [0, 1]. Assume that n ≥ n0 and choose
1

Since Σ is a compact positive operator on L2ρX , its r-th power λ := n 2r+γ . If the number of random features satisfies
r
Σ is well defined for any r > 0.6 Assumption 8 imposes a form of α+(2r−1)(1+γ−α) 108κ2
regularity/sparsity of fρ , which requires the expansion of fρ on the s ≥ c0 n 2r+γ log ,
λδ
basis given by the integral operator Σ. Note that this assumption
is more stringent than the existence of fρ in H. The latter is then the excess risk of f˜z,λ can be upper bounded as
equivalent to Assumption 8 with r = 12 (the worst case), in which
case fρ ∈ H need not have much regularity/sparsity.
2 18 − 2r+γ
2r
E f˜z,λ − E (fρ ) = f˜z,λ − fρ ≤ c1 log2 n ,
Noise Condition: The following two assumptions on noise are L2ρ
X
δ
considered in random features for classification.
where c0 , c1 are constants independent of (n, λ, δ), and n0 does
Assumption 9 (Massart’s low noise condition [11]). There exists not depends on n, λ, fρ , or ρ.
V ≥ 2 such that
Theorem 11 unifies several results in [53] that impose different
E(x,y)∼ρ [y|x] ≥ 2/V . assumptions. The simplest result is Theorem 1 in [53], which only
requires the three basic Assumptions 1–3 on existence, boundedness
Assumption 10 (Separation condition [11]). The points in X can
and continuity, corresponding to the the worst case of Theorem 11
be collected into two sets according to their labels as follows
with % = γ = 1 and √ r = 1/2. In this case, by choosing λ =
X1 := {x ∈ X : E[y|x] > 0} , n−1/2 , we require Ω( n log n) random features to achieve the
X−1 := {x ∈ X : E[y|x] < 0} . minimax convergence rate O(n−1/2 ); also see Table 4.
A more refined result is given in Theorem 2 in [53], which
For i ∈ {±1}, the distance of a point x ∈ Xi to the set X−i accounts for the capacity of the RKHS and the regularity of fρ ,
is denoted by ∆(x). We say that the data distribution satisfies a as quantified by the parameters γ ∈ [0, 1] (Assumption 5) and
separation condition if there exists ∆ > 0 such that ρX (∆(x) < r ∈ [ 21 , 1] (Assumption 8), respectively. Under these conditions
c) = 0. 1 1+γ(2r−1)
and choosing λ := n− 2r+γ , we require Ω n 2r+γ log n

2r
The above two assumptions, both controlling the noise level random features to achieve the convergence rate O n− 2r+γ .
in the labels, can be cast under into a unified framework [131] as Note that γ = 1 is the worst case, where the eigenvalues of
follows. Define the regression function η(x) = E[y|X = x] in K have the slowest decay, and γ = 1/(2a) ∈ (0, 1) means that
binary classification problems. The Massart’s low noise condition the eigenvalues follow a polynomial decay λi ∝ ni−2a . Table 4
means that there exists h ∈ (0, 1] such that for |η(x)| ≥ h for all presents this result with γ := 1/(2a) for better comparison with
x ∈ X . Here h characterizes the level of noise in classification the other results.
problems. If small h is small, then η(x) is close to zero, in The above two results apply to the standard RFF setting with
which case correct classification is difficult. Massart’s condition data-independent sampling. When {ωi }si=1 are sampled from a
can be extended to the following more flexible condition known as data-dependent distribution satisfying the compatibility condition
Tsybakov’s low noise assumption [131]. This assumption stipulates in Assumption 6 with % ∈ [0, 1], then Theorem 3 in [53] provide
that there exists a constant C > 0 such that for all sufficiently 1
an improved result. In this case, by choosing λ := n− 2r+γ , we
small t > 0, we have %+(1+γ−%)(2r−1)
require Ω n 2r+γ log n random features to achieve the
Pr {x ∈ X : |2η(x) − 1| ≤ t} ≤ C · tq ,
2r
convergence rate O n− 2r+γ .
for some q > 0. The separation condition in Assumption 10 is an If the compatibility condition is replaced by the stronger
extreme case of the Tsybakov’s noise assumption with q = ∞. It is Assumption 7 (optimized distribution), satisfied by q(ω)-sampling,
clear that noise-free distributions satisfy this separation assumption, the work [31] derives an improved bound that is the sharpest to
since the conditional probability η is bounded away from 1/2. date. Below we state a general result from [31] that covers both
p(ω)- and q(ω)-sampling.
5.2.2 Squared loss in KRR
Theorem 12 (Theorem 1 in [31]). Suppose that the regularization
In this section, we review theoretical results on the generalization parameter λ satisfies 0 ≤ nλ ≤ λ1 . We consider two sampling
properties of KRR with squared loss and random features, for schemes.
both the p(ω)-sampling (data-independent) and q(ω)-sampling
(data-dependent) settings. Table 4 summarizes these results for the
• {ωi }si=1 ∼ p(ω): if s ≥ (5z02 /λ) log(16dλK /δ) and
excess risk in terms of the key assumptions imposed, the learning
|z(ω, x)| ≤ z0 ,
s λ λ

• {ωi }i=1 ∼ q(ω): if s ≥ 5dK log 16dK /δ ,
rates, and the required number of random features.
We begin with the remarkable result by Rudi and Rosasco then for 0 < δ < 1, with probability 1 − δ , the excess risk of f˜z,λ
[53]. They are among the first to show that under some mild can be upper bounded as

6. A more general condition (r > 0) is often considered in approximation

2 √
theory; see [129], [130].
fez,λ − fρ ≤ 2λ + O(1/ n) + E fz,λ − E (fρ ) , (27)
L2ρ
X
17

Table 4
Comparison of learning rates and required random features for expected risk with the squared loss function.

sampling scheme Results key assumptions eigenvalue decays λ learning rates required s
1 1 √
[53, Theorem 1] - - n− 2 Op n− 2 s ≥ Ω( n log n)

− 2t − 4rt
i−2t n 1+4rt Op n 1+4rt s ≥ Ω( 2t+2r−1
1+4rt log n)
source condition
[53, Theorem 2]

− 1 − 2r 2r
1/i n 2r+1 Op n 2r+1 s ≥ Ω(n 2r+1 log n)
{ωi }si=1 ∼ p(ω)
1 1 √
e− c i
1 n− 2 Op n− 2 s ≥ Ω( n log log n)
1 1 √
[31, Corollary 2] - i−2t n− 2 Op n− 2 s ≥ Ω( n log n)
1 1 √
1/i n− 2 Op n− 2 s ≥ Ω( n log n)

− 2t − 4rt %+(2r−1)(2t+1−2t%)
i−2t n 1+4rt Oq n 1+4rt s ≥ Ω( 1+4rt log n)
source condition;
[53, Theorem 3] compatibility condition

− 1 − 2r 2r
1/i n 2r+1 Oq n 2r+1 s≥ Ω(n 2r+1 log n)

1 1
{ωi }si=1 ∼ q(ω)

e− c i
1 n− 2 Oq n− 2 s ≥ Ω(log2 n)
1 1
[31, Corollary 1] optimized distribution i−2t n− 2 Oq n− 2 s ≥ Ω(n1/(2t) log n)
1 1 √
1/i n− 2 Oq n− 2 s ≥ Ω( n log n)

where we recall that E fz,λ −E (fρ ) is the excess risk of standard 5.2.3 Lipschitz continuous loss function
KRR with an exact kernel (see Section 2). In this section, we consider loss functions ` that are Lipschitz
continuous. Examples include the hinge loss in SVM and the
Remark: A sharper convergence rate can be achieved if the
cross-entropy loss in kernel logistic regression. Table 5 summarizes
Rademacher complexity used in [31] is substituted by the local
several existing results for such loss functions in terms of the
Rademacher complexity [132], see [133] for details.
learning rate and the required number of random features. We
For p(ω)-sampling, Theorem 12 improves on the results of
briefly discuss these results below and refer the readers to the cited
[53] under the exponential and polynomial decays. Specifically,
work for the precise theorem statements.
if {ωi }si=1 ∼ p(ω), Theorem 12 requires s ∝ 1/λ log dλK .
If {ωi }si=1 ∼ p(ω), i.e., under the standard RFF setting with
Specialized
√ to the exponential decay case, this result requires
data-independent sampling, we have the following results.
Ω( n log log n) random features to achieve an O(n−1/2 ) learn-
• Theorem 1 in [32] shows that the excess risk converges at a
ing√rate, which is an improvement compared to [53] with
Ω( n log n) random features. certain O(n−1/2 ) rate with Ω(n log n) random features.
• Corollary 4 in [31] shows that with λ ∈ O(1/n) and
For q(ω)-sampling, Theorem 12 shows that if λ = n−1/2 ,
Ω (1/λ) log dλK random features, the excess risk of f˜z,λ

then s ∝ dλK log dλK random features is sufficient to incurs no
can be upper bounded by
loss in the expected risk if KRR, with a minimax learning rate √
√
O(n−1/2 ). Corollaries of this result under three different regimes E(f˜z,λ ) − E (fρ ) ≤ O 1/ n + O( λ) .
of eigenvalue decay are summarized in Table 4. √
Carratino et al. [134] extend the result of [53] to the setting The above bound scales with λ, which is different from
where KRR is solved by stochastic gradient descent (SGD). They the bound in Eq. (27) for the squared loss. Therefore, for
show that under the basic Lipschitiz continuous loss functions, we need to choose a
√ Assumptions 1–3 and some mild smaller regularization parameter λ ∈ O(1/n) to achieve the
conditions for SGD, Ω( n) random features suffice to achieve
the minimax learning rate O(n−1/2 ). This result matches those same O(n−1/2 ) convergence rate. Also note that as before
for standard KRR with an exact kernel [135]. The above results we can bound dλK under the three types of eigenvalue decay.
can be improved if in addition the source condition in Assumption If {ωi }si=1 ∼ q(ω), i.e., under the data-dependent sampling
1+α(2r−1)
8 holds, in which case Ω(n 2r+α ) random features suffice to setting, we have the following results.
2r
achieve an O(n− 2r+α ) learning rate. • For SVM with random features, under the optimized dis-

The work in [136] shows that if the randomized feature map is tribution in Assumption 7 and the low noise condition in
bounded (which is weaker than Assumption 2), then we have the Assumption 9, Theorem 1 in [11] provides bounds on the
following out-of-sample bound learning rates and the required number of random features.
This result is improved in [11, Theorem 2] if we consider the
stronger separation condition in Assumption 10. Details can
1
E(fg
z,λ ) − E(fz,λ ) ≤ O . be found in Table 5.
sλ • In Section 4.5 in [52] and Corollary 3 in [31], it is shown that
if Assumption 7 holds, then the excess risk of f˜z,λ converges
If we choose λ := n−1/2 , then Ω(n) random features are sufficient at an O(n−1/2 ) rate with Ω(dλK log dλK ) random features, if
to ensure an O(n−1/2 ) rate in the out-of-sample bound. we choose λ ∈ O(1/n).
18

Table 5
Comparison of learning rates and required random features for expected risk with a Lipschitz continuous loss function.

sampling scheme Results key assumptions eigenvalue decay λ learning rates required s
1
[32, Theorem 1] - - - Op n− 2 s ≥ Ω(n log n)
1
e− c i
1 1
n Op n− 2 s ≥ Ω(n log log n)
{ωi }si=1 ∼ p(ω)
1
[31, Corollary 4] - i−2t 1
n Op n− 2 s ≥ Ω(n log n)
1
1/i 1
n Op n− 2 s ≥ Ω(n log n)
c+2
1 1
s ≥ Ω(logc n log logc n)
1

e− c i n Oq n log n
low noise condition

− t − t 1
[11, Theorem 1] i−2t n 1+t Oq n 1+t log n s ≥ Ω(n 1+t log n)
optimized distribution
1
1/i 1
n Oq n− 2 s ≥ Ω(n log n)
{ωi }si=1 ∼ q(ω)
separation condition 1 2
e− c i
1
log2c+1 n log log n s ≥ Ω(log2c n log log n)

[11, Theorem 2]
optimized distribution n−2c Oq n
1
e− c i
1 1
n Oq n− 2 s ≥ Ω(log2 n)
[52, Section 4.5] 1
[31, Corollary 3]
optimized distribution i−2t 1
n Oq n− 2 s ≥ Ω(n1/(2t) log n)
1
1/i 1
n Oq n− 2 s ≥ Ω(n log n)

There is an abnormal but common experiment phenomenon Table 6

on kernel approximation and risk generalization, that is, a higher Dataset statistics.
kernel approximation quality does not always translate to better
generalization performance, see the discussion in [27], [45], [51]. datasets d #traing #test random split scaling
Understanding this inconsistency between approximation quality ijcnn1 22 49,990 91,701 no -
and generalization performance is an important open problem in EEG 14 7,490 7,490 yes mapstd
this topic. Here we present a preliminary result for KRR: a better cod-RNA 8 59,535 157,413 no mapstd
covtype 54 290,506 290,506 yes minmax
approximation quality cannot guarantee a lower generalization risk, magic04 10 9,510 9,510 yes minmax
see Proposition 1 as below, with proof deferred to Appendix A. letter 16 12,000 6,000 no minmax
skin 3 122,529 122,529 yes minmax
Proposition 1. Given the target function fρ and the original kernel a8a 123 22,696 9,865 no -
matrix K , consider two random features based algorithms A1 and MNIST 784 60,000 10,000 no minmax
A2 yielding two approximated kernel matrices K f1 and K f2 , and CIFAR-10 3072 50,000 10,000 no -
(A1) (A2)
their respective KRR estimators f˜z,λ and f˜z,λ . Then for a new MNIST-8M 784 8,100,000 10,000 no -
sample x, even if kK − K f1 k ≤ kK − K f2 k holds in some norm
metric, there exists one case for the excess risk such that matrix. Ghashami et al. [120] combine random features with matrix
(A1) (A2) sketching for KPCA. For finding the top-` principal components,
E[f˜z,λ (x)] − E[fρ (x)] ≥ E[f˜z,λ (x)] − E[fρ (x)] .
they improve the time and space complexities to O(nsd+n`s) and
Remark: Our proof is geometric by constructing a counter- O(sd + `s), respectively. Xie et al. [94] propose to use the doubly
example. It requires that the kernel admits (at least) polynomial stochastic gradients scheme to accelerate KPCA. The authors of
decay, which holds for the common-used Gaussian kernel and [121] investigate the statistical consistency of KPCA with random
could be further relaxed for the existence of the proof. features. They show that the top-` eigenspace of the empirical
covariance matrix in H e converges to the covariance operator in
√ √
5.3 Results for nonlinear component analysis H at the rate of O(1/ n + 1/ s). Therefore, √ Ω(n) random
features are required to guarantee a O(1/ n) rate. Ullah et al.
In addition to supervised learning problems such as classification
[122] instead pose KPCA as a stochastic optimization problem
and regression, random features can also be used in unsupervised
and show that the empirical risk minimizer (ERM) in the √ random
learning, e.g., randomized nonlinear component analysis. Here we
feature
√ space converges in objective value at an O(1/ n) with
give an overview of the results for this problem.
Ω(` n log n) random features.
The authors of [10] propose to use random features to
approximate the kernel matrix in kernel Principal Component
Analysis (KPCA) and kernel Canonical Correlation Analysis 6 E XPERIMENTS
(KCCA). They show that the approximate kernel matrixpconverges In this section, we empirically evaluate the kernel approximation
to the true one in operator norm at a rate of O(n log n/s). and classification performance of representative random features
More precisely, s = O((log n)2 /2 ) suffices to ensure that algorithms on several benchmark datasets. All experiments are
kKf − Kk2 ≤ n with the probability 1 − 1/n. Their algorithm implemented in MATLAB and carried out on a PC with Intelr
takes O(ns2 + nsd) time to construct feature functions and i7-8700K CPU (3.70 GHz) and 64 GB RAM. The source code of
O(s2 + sd) space to store the feature functions and covariance our implementation can be found in http://www.lfhsgre.org.
19

6.1 Experimental settings including the number of feature dimension, training samples,
Kernel: We choose the popular Gaussian kernel, zero/first-order test data, training/test split, and the normalization scheme.
arc-cosine kernels, and polynomial kernels for experiments. These eight non-image benchmark datasets can be downloaded
i) Gaussian kernel: from https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ or
the UCI Machine Learning Repository7 . Some datasets include a
kx − x0 k22

0
k(x, x ) = exp − , (28) training/test partition, denoted as “no” in the random split column.
2ς 2 For the other datasets, we randomly pick half of the data for training
where the kernel width parameter ς is tuned via 5-fold inner cross and the rest for testing, denoted as “yes” in the random split column.
validation over a grid of {0.01, 0.1, 1, 10, 100}. There are two typical normalization schemes used in these datasets:
To evaluate the Gaussian kernel, we conduct the following “mapstd” and “minmax”. The “mapstd” scheme sets each sample’s
representative algorithms for comparison: RFF [9], ORF [24], mean to 0 and deviation to 1, while the “minmax” scheme is a
SORF [24], ROM [80], Fastfood [36], QMC [37], SSF [43], standard min-max scaling operation mapping the samples to the
GQ [26], and LS-RFF [31]. These algorithms include both data- bounded set [0, 1]d . Two representative image datasets are the
independent and data-dependent approaches and involve a variety of MNIST handwritten digits dataset [137] and the CIFAR10 natural
techniques including Monte Carlo and quasi-Monte Carlo sampling, image classification dataset [138], summarized in the last two rows
quadrature rules, variance reduction, and computational speedup in Table 6. The MNIST dataset contains 60,000 training samples
using structural/circulant matrices. and 10,000 test samples, each of which is a 28 × 28 gray-scale
ii) arc-cosine kernels: Different from Gaussian kernels and image of a handwritten digit from 0 to 9. Here the “minmax”
polynomial kernels, the designed arc-cosine kernels [60] can be normalization scheme means that each pixel value is divided by
closely connected to neural networks, which include feature spaces 255. The CIFAR10 dataset consists of 60,000 color images of
that mimic the sparse, nonnegative, distributed representations of size 32 × 32 × 3 in 10 categories, with 50,000 for training and
single-layer threshold networks. The used zeroth order kernel is 10,000 for test. Besides, apart from medium/large scale datasets in
given explicitly by our experiments, we also evaluate the compared approaches on a
ultra-large scale dataset MNIST 8M [139], which is derived from
θ
k(x, x0 ) = 1 − , the MNIST dataset by random deformations and translations. It
π shares the same number of feature dimension and test data with the
which corresponds to the Heaviside step function σ(ω > x) = MNIST dataset, but has 8,100,000 training data.
1 >
2 (1 + sign(ω x)) in Eq. (6). The first order kernel is Evaluation metrics: We evaluate the performance of all the
1 compared algorithms in terms of approximation error, time
k(x, x0 ) = kxk2 kx0 k2 (sin θ + (π − θ) cos θ) , cost, and test accuracy. We use kK − Kk f F /kKkF as the
π
error metric for kernel approximation. A small error indicates
which corresponds to the ReLU activation function σ(ω > x) = a high approximation quality. To compute the approximation
max{0, ω>x} in Eq. (6). error, we randomly sample 1,000 data points to construct the
Here we consider the zero/first-order arc-cosine kernel and sub-feature matrix and the sub-kernel matrix. We record the
compare these ten algorithms (used for Gaussian kernel approx- time cost of each algorithm on generating feature mappings. The
imation) as well. Note that, the theoretical foundation behind kernel width ς in the Gaussian kernel is tuned by five-fold cross
random features, Bocher’s theorem, is invalid to arc-cosine kernels. validation over the grid {0.01, 0.1, 1, 10, 100}. The regularization
Thankfully, according to the formulation of arc-cosine kernels parameter λ in ridge linear regression and the balance parameter in
admitting in Eq. (6), the Monte Carlo sampling (e.g., RFF) is liblinear are tuned via 5-fold inner cross validation on a grid
able to used for arc-cosine kernel approximation. In this case, the of {10−8 , 10−6 , 10−4 , 10−3 , 10−2 , 0.05, 0.1, 0.5, 1, 5, 10} and
remaining algorithms, e.g., ORF, QMC, and Fastfood, on various {0.01, 0.1, 1, 10, 100}, respectively. For the sake of computational
sampling strategies, can be still applicable to arc-cosine kernels, at efficiency, we conduct a relatively coarse hyper-parameter tuning.
least in the algorithmic aspect. Nevertheless, a refined hyper-parameter search might result in better
iii) Polynomial kernel: This is a widely used family of dot classification performance. The random features dimension s in our
product kernels given by experiments takes value in {2d, 4d, 8d, 16d, 32d}. All experiments
k(x, x0 ) = (1 + hx, x0 i)b , are repeated 10 times and we report the average approximation
error, average classification accuracy with their respective standard
where b is the order. In our experiments, the order is set to b = 2. deviations as well as the time cost for generating random features.
Note that, different from Gaussian kernels and arc-cosine kernels,
polynomial kernels admit neither the Bochner’s theorem nor the 6.2 Results for the Gaussian Kernel
sampling formulation in Eq. (6), so classical random features based 6.2.1 Results on non-image benchmark datasets
algorithms are applicable to arc-cosine kernels but still invalid to Here we test various random features based algorithms, including
polynomial kernels even though both of them are dot-product. As RFF [9], ORF [24], SORF [24], ROM [80], Fastfood [36], QMC
a result, algorithms for polynomial kernel approximation are often [37], SSF [43], GQ [26], LS-RFF [31] for kernel approximation and
totally different. In this survey, we include three representative then combine these algorithms with lr/liblinear for classification
approaches for evaluation, including Random Maclaurin (RM) [34], on eight non-image benchmark datasets, refer to Appendix B.1 for
Tensor Sketch (TS) [73], and Tensorized Random Projection (TRP) details. Here we summarize the best performing algorithm on each
[74]. dataset in terms of the approximation quality and classification
Datasets: We consider eight non-image benchmark datasets, two accuracy in Table 7, where we distinguish the small s case (i.e.,
representative image datasets, and a ultra-large scale dataset
for evaluation. Table 6 gives an overview of these datasets 7. https://archive.ics.uci.edu/ml/datasets.html.
20

Table 7
Results statistics on several classification datasets. The best algorithm on each dataset is given in two cases: low dimensional (i.e., s = 2d, 4d) and
high dimensional (i.e., s = 16d, 32d) according to approximation quality, test accuracy in linear regression or liblinear. The notation “-” means that
there is no statistically significant difference in the performance of most algorithms.

approximation lr liblinear
datasets
small s large s small s large s small s large s
ijcnn1 SSF SORF, QMC, ORF - - Fastfood -
EEG SSF ORF - - - -
cod-RNA SSF - - - - -
covtype ORF - - - - -
magic04 SSF SSF, ORF, QMC, ROM - - - -
letter SSF SSF, ORF - - - -
skin SSF, ROM QMC - - - -
a8a - - - - SSF -

adequate in SSF, additional points (i.e., a larger s) may have a small

marginal benefit in variance reduction under the large s setting.
Consequently, the approximation error of SSF sometimes stays
almost the same with a larger number of random features. QMC and
ORF seek for variance reduction on random features. Nevertheless,
they often work well in the large s case. As demonstrated by
the expression for variance of ORF [24] and convergence rate in
QMC [37], this theoretical result is consistent with the numerical
performance of ORF and QMC, which may explain the reason why
they work better in a large s setting than a small s case.
Results on arc-cosine kernels and polynomial kernels can be
in Appendix B. Besides, apart from the above used medium/large
scale datasets in our experiments, we also evaluate the compared
approaches on a ultra-large scale dataset MNIST 8M [139] with
millions of data. Due to the memory limit, following the doubly
stochastic framework [38], we incorporate these random features
based approaches under the data streaming setting for the reduction
of time and space complexity.

6.2.2 Classification results on MNIST and CIFAR10

Here we consider the MNIST and CIFAR10 datasets, on which
we test these random features based algorithms for kernel
approximation and then combine these algorithms with liblinear
for image classification. In our experiment, we use the Gaussian
kernel8 , whose kernel width ς is tuned by 5-fold cross validation
over the grid ς = [0.01, 0.1, 1, 10, 100]. For the MNIST database,
(a) MNIST (b) CIFAR10
we directly use the original 784-dimensional feature as the data.
Figure 6. Approximation error, time cost, and test accuracy of various For better performance on the CIFAR10 dataset, we use VGG16
algorithms with liblinear on two image classification datasets. with batch normalization [141] pre-trained on ImageNet [142] as a
feature extractor. We fine-tune this model on the CIFAR10 dataset
s = 2d or s = 4d) and the large s case (i.e., s = 16d or s = with 240 epochs and a mini-batch size 64. The learning rate is
32d). The notation “-” therein means that there is no statistically initialized at 0.1 and then divided by 10 at the 120-th, 160-th, and
significant difference in the performance of most algorithms. 200-th epochs. For each color image, a 4096 dimensional feature
In terms of approximation error, we find that SSF, ORF, and vector is obtained from the output of the first fully-connected layer
QMC achieve promising approximation performance in most in this fine-tuned neural network.
cases. Recall that the goal of using random features is to find Figure 6(a) shows the approximation error, the time cost (sec.),
a finite-dimensional (embedding) Hilbert space to approximate the and the classification accuracy by liblinear across a range of
original infinite-dimensional RKHS so as to preserve the inner s = 1000 to s = 10, 000 random features on the MNIST database.
product. To achieve this goal, SSF, QMC, and ORF are based on We find that ORF and SSF yield the best approximation quality.
a similar principle, namely, generating random features that are Despite that most algorithms achieve different approximation
as independent/complete as possible to reduce the randomness in errors, there is no significant difference in the test accuracy, which
sampling. Regarding to SSF, we find that SSF performs well under
8. As indicated by [13], [140], (convolutional) NTK generally performs
the small s case, but the significant improvement does not hold better than Gaussian kernel but it is still non-trivial to obtain a efficient random
for the large s case. This might be because, a few points can be features mapping for (convolutional) NTK without much loss on prediction.
21

compared to general arbitrary deep networks. This is a potentially

fruitful research direction, and one hand, the optimization and
generalization of such model have been studied in [22], [150] in
deep learning theory. On the other hand, in order to explain the
double descent curve of random features in over-parameterized
regimes, we often work in a high dimensional setting, which is
more subtle than classical results in standard settings, as indicated
by recent random matrix theory (RMT) [151], [152], [153]. An
intuitive example [154] is, kK − ZZ> kF → 0 always hold in
low/high dimensions as s → ∞ but kK − ZZ> k2 → 0 does not
(a) sonar (low-dimensioanl) (b) MNIST (high-dimensional) hold for n, d, s → ∞. Accordingly, in this section, we provide
an overview on analysis of (high dimensional) random features
Figure 7. Training error, test error, and approximation error of random in over-parameterized setting, especially on double descent. We
features regression with λ = 10−8 on the sonar dataset with n =
208, d = 60 and the sub-set of MNIST (class 1 versus class 2) with remark upfront that the random features model on double descent is
n = 200, d = 784. not the only way for analyzing DNNs. Many other approaches, with
different points of views, have been proposed for deep learning
corresponds to the results on non-image datasets. Similar results are theory, but they are out of scope of this survey.
observed on the CIFAR10 dataset with s = 5000 to s = 12, 000
random features; see Figure 6(b). Note that most algorithms take 7.1 Results on High Dimensional Random Features in
the similar time cost on generating random features except for Over-parameterized Setting
the data-dependent algorithm LS-RFF. Several structured based Here we briefly introduce the problem setting of high dimensional
approaches (e.g., Fastfood, SORF, ROM) do not achieve significant random features in over-parameterized regimes, and then discuss
reduction on time cost due to the relatively inefficient Matlab the techniques used in various studies.
built-in function to implement the Walsh-Hadamard transform. In the basic setting, high dimensional random features often
work with least squares regression setting in an asymptotic
viewpoint, i.e., n, d, s → ∞ with d/n → ψ1 ∈ (0, ∞) and
7 T RENDS : H IGH - DIMENSIONAL R ANDOM
s/n → ψ2 ∈ (0, ∞), in which overparameterization corresponds
F EATURES IN OVER - PARAMETERIZED SETTINGS to ψ2 ≥ 1. The considered data generation model in the basic
In the previous sections, we review random features based setting is quite simple. To be specific, the training data is collected
algorithms and their theoretical results, that works under a fixed d in a matrix X ∈ Rn×d , the rows
√ of which are assumed to be drawn
setting with s n. Random features based approaches are simple i.i.d from N (0, 1) or Sd−1 ( d). The labels are given by a linear
in formulation but enjoy nice empirical validations and theoretical ground truth corrupted by some independent additive Gaussian
guarantees in kernel approximation and generalization properties. noise: yi = fρ (xi ) + εi , where fρ (x) = hx, ζi for a fixed but
Recently, analysis of over-parameterized models [22], [143], [144], unknown ζ and εi ∼ N (0, 1). The transformation matrix under
[145], [146] has attracted a lot of attention in learning theory, this setting is often taken as the random Gaussian matrix with the
partly due to the observation of several intriguing phenomena, ReLU activitation function (recall Eq. (6)). Current approaches
including capability of fitting random labels, strong generalization employ various data generation schemes and assumptions to obtain
performance of overfitted classifiers [19] and double descent in the a refined analysis beyond double descent under the basic setting.
test error curve [20], [147]. Moreover, Belkin et al. [20], [148] point According to these criteria, we summarize the problem setting of
out that the above phenomena are not unique to deep networks but various representative approaches in Table 8. In the next, we briefly
also exist in random features and random forests. In Figure 7, we review the conceptual and technical contributions of underlying
report the empirical training error, the test error, and the kernel approaches on high dimensional random features.
approximation error of random features regression as a function Belkin et al. [162] begin with an one-dimensional (noise-
of s/n on the sonar dataset and the MNIST dataset [137]. Even free) version of the random features model, and provide an
with n, d, s only in the hundreds, we can still observe that as s asymptotic analysis to explain the double descent phenomenon.
increases, the training error reduces to zero and the approximation The subsequent work focuses on the standard random features
error monotonously decreases. However, the test error exhibits model under different settings and assumptions. It is clear that,
double descent, i.e., a phase transition at the interpolation threshold: the presence of the nonlinear activation function σ(·) makes the
moving away from this threshold on both sides trends to reduce the random features model intractable to study the related (limiting)
generalization error. This is somewhat striking as it goes against the spectral distribution. Accordingly, the key issue in this topic mainly
conventional wisdom on bias-variance trade-off [149]: predictors focuses on studying random matrices with nonlinear dependencies,
that generalize well should trade off the model complexity against e.g., how to disentangle the nonlinear function σ(·) by Gaussian
training data fitting. equivalence conjecture. Hastie et al. [143] consider the basic setting
The above observations have motivated researchers to build on endowed by a bounded activation function with a standardization
the elegant theory of random features to provide an analysis of condition, i.e., E[σ(t)] = 0 and E[σ(t)2 ] = 1 for t ∼ N (0, 1).
neural networks in the over-parameterized regime. To be specific, By establishing asymptotic results on resolvents of random block
RFF can be regarded as a two-layer (large-width) neural network, matrices from RMT, the limiting of the variance is theoretically
where the weights in the first layer are chosen randomly/fixed demonstrated to be increasing for ψ2 ∈ (0, 1), decreasing for
and only the output layer is optimized. This is a typical over- ψ2 ∈ (1, ∞), and diverging as ψ2 → 1.
parameterized model if we take s n. As such, two-layer neural In a similar spirit, Mei and Montanari [144] use RMT
networks in this regime are more amenable to theoretical analysis as to study the spectral distribution of the Gram matrix Z =
22

Table 8
Comparison of problem settings on analysis of high dimensional random features on double descent.

data generation
studies metric asymptotic? result
{xi }n
i=1 fρ activation function W
[143, Theorem 7] population risk N (0, Id ) hx, ζi normalized N (0, 1/d) 3 variance % &
[155, Theorem 4] population risk N (0, Id ) hx, ζi bounded N (0, 1/d) 3 variance % &
√ √
[144, Theorem 2] expected excess risk Sd−1 ( d) hx, ζi + nonlinear 1
bounded Unif(Sd−1 ( d)) 3 variance, bias % &

[156] expected excess risk N (0, Id ) hx, ζi ReLU N (0, 1) 3 refined 2

[157] generalization error N (0, Id ) f (hx, ζi) general general 3 %&

[158, Theorem 1] generalization error N (0, Id ) hx, ζi normalized N (0, 1) 3 refined 2

[159, Theorem 1] generalization error N (0, Id ) hx, ζi general general 3 %&
[160, Proposition 1] generalization error N (0, Id ) hx, ζi odd, bounded sub-Gaussian 3 %&
[161, Theorem 5.1] expected excess risk Gaussian general [cos(·), sin(·)] N (0, 1) 7 %&

[154, Theorem 3] generalization error general -3 [cos(·), sin(·)] N (0, 1) 3 %&

√
1
The nonlinear component is a centered isotropic Gaussian process indexed by x ∈ Sd−1 ( d).
2
A refined decomposition on variance is conducted by sources of randomness: “noise variance”, “initialization variance”, and “sampling variance” to
possess each term [156] or their interpretations [158].
3
It makes no assumptions on fρ but requires that test data “behave” statistically like the training data by concentrated random vectors.

√ √
σ(XW> / d)/ d by considering the Stieltjes transform of a “calculus of deterministic equivalents” for random matrices [170].
related random block matrix, and show that, under least squares This technique is also used in [154] to derive the exact asymptotic
regression setting in an asymptotic viewpoint, both the bias and deterministic equivalent of EW [(ZZ> +nλI)−1 ], which captures
variance have a peak at the interpolation threshold ψ2 = 1 and the asymptotic behavior on double descent. Note that, this work
diverge there when λ → 0. Under this framework, according to the makes no data assumption to match real-world data, which is
randomness stemming from label noise, initialization, and training different from previous work relying on specific data distribution.
features, a refined bias-variance decomposition is conducted by
[156], [163] and further improved by [158], [164] using the analysis 7.2 Discussion on Random Features and DNNs
of variance. Apart from refined error decomposition schemes, the
As mentioned, random features models have been fruitfully used to
authors of [155], [157], [159] consider a general setting on convex
analyze the double descent phenomenon. However, it is non-trivial
loss functions, transformation matrix, and activation functions for
to transfer results for these models to practical neural networks,
regression and classification. Here the techniques used for analysis
which are typically deep but not too wide. There is still a substantial
are not limited to RMT. Instead, replica method [165] (a non-
gap between existing theory based on random features and the
rigorous heuristic method from statistical physics) used in [156],
modern practice of DNNs in approximation ability. For example,
[157], [163] and the convex Gaussian min-max (CGMM) theorem
under the spherical data setting, Ghorbani et al. [67] (a more general
[166] used in [159] are two alternative way to derive the desired
version in [171] on data distribution) point out that as n → ∞,
results. Note that, CGMM requires the data to be Gaussian, which
a random features regression model can only fit the projection
might restrict the application scope of their results but is still
of the target function onto the space of degree-` polynomials
a common-used technical tool for max-margin linear classifier
when s = Ω(d`+1−δ ) random features are used for some δ > 0.
[167], boosting classifiers [168], and adversarial training for linear
More importantly, if s, d are taken as large with s = Ω(d), then the
regression [169] in over-parameterized regimes. Admittedly, the
function space by random features can only capture linear functions.
applied replica method in statistical physics is quite different from
Even if we consider the NTK model, it can just capture quadratic
[144] for tackling inverse random matrices in RMT. However,
functions. That means, both random features and NTK have limited
most of the above methods admit the equivalence between the
approximation power in the lazy training scheme [65]. In addition,
considered data model and the Gaussian covariate model. That
Yehudai and Shamir [172] show that the random features model
means, problem (3) with Gaussian data can be asymptotically
cannot efficiently approximate a single ReLU neuron as it requires
equivalent to
the number of random features to be exponentially large in the
n
1X feature dimension d. This is consistent with the classical result
min ` yi , β > (µ0 1k + µ1 W xi + µ? ti ) + λkβk22 , for kernel approximation in the under-parameterized regime: the
β∈Rs n
i=1
random features model, QMC, and quadrature based methods
where {ti }n i=1 ∼ N (0, Id ), µ0 = E[σ(t)], µ1 = E[tσ(t)] and require s = Ω(exp(d)) to achieve an approximation error [26].
µ? = E[σ(t)2 ] − µ20 − µ21 for a standard Gaussian variable t. This Admittedly, the above results may appear pessimistic due to
equivalence on generalization error in an asymptotic viewpoint is the simple architecture. Nevertheless, random features is still an
proved in [160]. effective tool, at least the first step, for analyzing and understanding
Different from the above results in an asymptotic view, DNNs in certain regimes, and we believe its potential has yet to
Jacot et al. [161] present a non-asymptotic result by taking finite- be fully exploited. Note that the random features model is still
size Stieltjes transform of generalized Wishart matrix, and further a strong and universal approximator [173] in the sense that the
argue that random feature models can be close to KRR with an RKHSs induced by a broad class of random features are dense in
additional regularization. The used technical tool is related to the the space of continuous functions. While the aforementioned results
23

show that the number of required features may be exponential in the • There exist significant gaps between the random features
worst case, a more refined analysis can still provide useful insights model and practical neural networks, both in theory and
for DNNs. One potential way forward in deep learning theory is empirically. Even for fitting simple quadratic or mixture
to use the random features model to analyze DNNs with pruning. models, the random features model cannot achieve a zero
For example, the best paper [174] in the Seventh International error with a finite number of neurons in general, while NTK
Conference on Learning Representations (ICLR2019) put forward and fully trained networks can [180]. Numerical experiments
the following Lottery Ticket Hypothesis: a deep neural network indicate that the prediction performance of NTK and CNTK
with random initialization contains a small sub-network which, may significantly degrade if the random features are generated
when trained in isolation, can compete with the performance of from practically sized nets [13].
the original one. Malach et al. [54] provide a stronger claim that • Despite the limitations of existing theory, random features
a randomly-initialized and sufficiently over-parameterized neural models are still useful for understanding and improving DNNs.
network contains a sub-network with nearly the same accuracy as For example, understanding the equivalence between the
the original one, without any further training. Their analysis points random features model and weight pruning in the Lottery
to the equivalence between random features and the sub-network Ticket Hypothesis [54], may be promising future directions.
model. As such, the random features model is potentially useful We hope that this survey will stimulate further research on the
for network pruning [175] in terms of, e.g., guiding the design of above open problems.
neurons pruning for accelerating computations, and understanding
network pruning and the full DNNs. ACKNOWLEDGEMENTS
The research leading to these results has received funding from
the European Research Council under the European Union’s
8 C ONCLUSION Horizon 2020 research and innovation program / ERC Advanced
Grant E-DUALITY (787960). This paper reflects only the authors’
In this survey, we systematically review random features based
views and the Union is not liable for any use that may be made
algorithms and their associated theoretical results. We also give an
of the contained information. This work was supported in part
overview on generalization properties of high dimensional random
by Research Council KU Leuven: Optimization frameworks for
features in over-parameterized regimes on double descent, and
deep kernel machines C14/18/068; Flemish Government: FWO
discuss the limitations and potential of random features in the theory
projects: GOA4917N (Deep Restricted Kernel Machines: Methods
development for neural networks. Below we provide additional
and Foundations), PhD/Postdoc grant. This research received
remarks and discuss several open problems that are of interest for
funding from the Flemish Government (AI Research Program).
future research.
This work was supported in part by Ford KU Leuven Research
• As a typical data independent method, random features are Alliance Project KUL0076 (Stability analysis and performance
simpler to implement, easy to parallelize, and naturally apply improvement of deep reinforcement learning algorithms), EU
to streaming or dynamic data. Current efforts on Nyström H2020 ICT-48 Network TAILOR (Foundations of Trustworthy AI
approximation by a preconditioned gradient solver parallelized - Integrating Reasoning, Learning and Optimization), Leuven.AI
with multiple GPUs [176] and quantum algorithms [112] Institute; and in part by the National Natural Science Foundation
can guide us to design powerful implementation for random of China 61977046, in part by National Science Foundation
features to handling millions/billions data. grants CCF-1657420 and CCF-1704828, and in part by SJTU
• Experimental comparisons show that better kernel approxima- Global Strategic Partnership Fund (2020 SJTU-CORNELL) and
tion does not directly translate to lower generalization errors. Shanghai Municipal Science and Technology Major Project
We partly answer this question in the current survey but it (2021SHZDZX0102).
may be not sufficient to explain this phenomenon. We believe
this issue deserves further in-depth study.
• Kernel learning via the spectral density is a popular direction
[87], [89], which can be naturally combined with Generative
Adversarial Networks (GANs); see [84] for details. In
this setting, one may associate the learned model with an
implicit probability density that is flexible to characterize the
relationships and similarities in the data. This is an interesting
area for further research.
• The double descent phenomenon has been observed and
studied in random features model by various technical tools
under different settings. Current theoretical results, such as
those in [144], [154], may be extended to a more general
setting with less restricted assumptions on data generation,
model formulation, and the target function. Besides, more
refined analysis and delicate phenomena beyond double
descent have been investigated on the linear model, e.g.,
multiple descent phenomena [177] and optimal (negative)
regularization [178], [179]. Understanding these more delicate
phenomena for random features requires further investigation
and refined analysis.
24

A PPENDIX A
P ROOF OF P ROPOSITION 1
Proof. It is clear that an exact KRR estimator is fz,λ (x) = k(x, X)(K + nλI)−1 y and its random features based version is
f˜z,λ (x) = k̃(x, X)(K f + nλI)−1 y , where K f = ZZ> with Z ∈ Rn×s . The definition of the excess risk for least squares implies
h i
E(f˜z,λ ) − E(fρ ) = E(f˜z,λ ) − E(fz,λ ) + [E(fz,λ ) − E(fρ )] = kf˜z,λ − fz,λ k2 + kfz,λ − fρ k2 ,

where the first term in the right hand is the expected error difference between the original KRR and its random features approximation
version. The second term in the right hand is the excess risk of KRR, which is independent of the quality of kernel approximation.
Specifically, the first term can be further expressed by the representer theorem
n h i 2
!
kf˜z,λ − fz,λ k = Ex [f˜z,λ (x) − fz,λ (x)] = Ex
X
2 2
α̃i k̃(xi , x) − αi k(xi , x) . (29)
i=1

Intuitively speaking, kernel approximation aims to preserve the inner product in two Hilbert spaces, i.e., hk(x, ·), k(x0 , ·)iH ≈
hk̃(x, ·), k̃(x0 , ·)iHe . Nevertheless, the preservation of the inner-product does not immediately guarantee a small value of
α̃i hk̃(x, ·), k̃(x0 , ·)iHe − αi hk(x, ·), k(x0 , ·)iH in Eq. (29).
Informally, the proof idea is the following: define K f = K + E and k̃(x, X) = k(x, X) + ˜ with the residual error matrix
n×n 1×n
E ∈R and the residual error vector ˜ ∈ R such that k̃(x, X) ∈ R1×n . Generally, the residual error E and ˜ show the
consistency, that is, a small kernel approximation error kEk implies a small k˜ k. Consider two random features based algorithms A1 and
A2 yielding two approximated kernel matrices K f1 and K f2 , and their respective KRR estimators f˜(A1) and f˜(A2) . The corresponding
z,λ z,λ
residual error matrices/vectors are defined as (E1 , ˜1 ) and (E2 , ˜2 ) such that K f1 := K + E1 and K f2 := K + E2 . Without loss
of generality, we assume kE1 k ≤ kE2 k and k˜ 1 k ≤ k˜2 k. In this case, our target is to prove that, there exists one case such that
(A1) (A2)
|f˜z,λ (x) − fz,λ (x)| ≥ |f˜z,λ (x) − fz,λ (x)|. For notational simplicity, denote T (E, ˜) := hy> , k(x, X)(K + nλI)−1 E − ˜i,
T1 (E1 , ˜1 ) := hy> , k(x, X)(K + nλI)−1 E1 − ˜1 i, and T2 (E2 , ˜2 ) := hy> , k(x, X)(K + nλI)−1 E2 − ˜2 i. To prove our result,
we make the following three assumptions:
• I. the residual matrix E is semi-positive definite and K f1 , Kf2 are non-singular.
• II. nλ ≤ λ1 (K f1 ) ≤ λ1 (Kf2 ), and K
f2 admits (at least) polynomial decay.
• III. the inner product hy> , k(x, X)(K + nλI)−1 E − ˜i =: T (E, ˜) is non-negative.
The above three assumptions are mild, common-used and attainable, see in [31]. Specifically, we only need to prove the existence of
(A1) (A2)
our claim: there exists one case such that |f˜z,λ (x) − fz,λ (x)| ≥ |f˜z,λ (x) − fz,λ (x)| under kE1 k ≤ kE2 k and k˜ 1 k ≤ k˜
2 k.
Therefore, the above assumptions could be further relaxed.
According to Eq. (29), for a new sample x, we in turn focus on |f˜z,λ (x) − fz,λ (x)|, which can be upper bounded by

|f˜z,λ (x) − fz,λ (x)| = |k(x, X)(K + nλI)−1 y − [k(x, X) + ˜](K + E + nλI)−1 y|
= |k(x, X)[(K + nλI)−1 − (K + nλI + E)−1 ]y − ˜(K + nλI + E)−1 y|
= |[k(x, X)(K + nλI)−1 E − ˜](K + nλI + E)−1 y|
n
X 1 (30)
≤ [k(x, X)(K + nλI)−1 E − ˜]y
λ
i=1 i
(K + E) + nλ
n n
X 1 X T (E, ˜)
= hy> , (k(x, X)(K + nλI)−1 E − ˜i =:
i=1
λi (K + E) + nλ i=1
λi (K + E) + nλ

where the third equality holds by A−1 −B −1 = A−1 (B−A)B −1 . The first inequality derives from a> Ab ≤ a> b tr(A) for two semi-
positive definite matrix A and b> a (which can be derived from the used assumptions). Further, by virtue of a> Ab ≥ λn (A) tr(a> b),
the error |f˜z,λ (x) − fz,λ (x)| can be lower bounded by

|f˜z,λ (x) − fz,λ (x)| = |[k(x, X)(K + nλI)−1 E − ˜](K + nλI + E)−1 y|
hy> , [k(x, X)(K + nλI)−1 E − ˜]i T (E, ˜) (31)
≥ =:
λ1 (K + E) + nλ λ1 (K + E) + nλ
Combining Eqs. (30) and (31), we have
n
T (E, ˜) T (E, ˜)
≤ |f˜z,λ (x) − fz,λ (x)| ≤
X
0≤ . (32)
λ1 (K + E) + nλ λ (K + E) + nλ
i=1 i

Considering such two algorithms A1 and A2, under the condition of kE1 k ≤ kE2 k and k˜
1 k ≤ k˜
2 k, there exists one case such
that T1 (E1 , ˜1 ) ≥ T2 (E2 , ˜2 ), i.e.,

hy> , (k(x, X)(K + nλI)−1 E1 − ˜1 i ≥ hy> , (k(x, X)(K + nλI)−1 E2 − ˜2 i , (33)
25

k(x, X)(K + nλI)−1 E2

k(x, X)(K + nλI)−1 E1

˜1
˜2
y
T1 (E1 , ˜1 ) T2 (E2 , ˜2 )

Figure 8. Illustration of the geometric proof for one case such that T1 (E1 , ˜1 ) ≥ cT2 (E2 , ˜2 ) under the condition of kE1 k ≤ kE2 k and k˜
1 k ≤ k˜
2 k,
where c is some constant.

which can be achieved by a geometry explanation in Figure 8. By virtue of Eq. (33) and Assumption II, we have
T1 (E1 , ˜1 ) T2 (E2 , ˜2 )
− e ≥ 0.
=: C
λ1 (K + E1 ) + nλ λ1 (K + E2 ) + nλ
The above inequality implies
n n
T2 (E2 , ˜2 ) (A1) (A2) T1 (E1 , ˜1 )
≤ f˜z,λ (x) − fz,λ (x) − f˜z,λ (x) − fz,λ (x) ≤ C
X X
e−
C e+ .
i=2
λi (K + E2 ) + nλ i=2
λi (K + E1 ) + nλ
The left-hand of the above inequality can be further improved as
n
(A1) (A2) T2 (E2 , ˜2 )
f˜z,λ (x) − fz,λ (x) − f˜z,λ (x) − fz,λ (x) ≥ C
X
e−
i=2
λi (K + E2 ) + nλ
n
T1 (E1 , ˜1 ) X T2 (E2 , ˜2 )
≥ −
λ1 (K + E2 ) + nλ i=1 λi (K + E2 ) + nλ
n
T1 (E1 , ˜1 ) T2 (E2 , ˜2 ) X λi (K + E2 )
≥ −
λ1 (K + E2 ) + nλ λn (K + E2 ) i=1 λi (K + E2 ) + nλ
T1 (E1 , ˜1 ) T2 (E2 , ˜2 ) λ
≥ − df ,
2λ1 (K
f2 ) f2 ) K
λn (K 2

where dλf is the “effective dimension” of Kf2 defined in Eq. (21) and the last inequality follows from Assumption II.
K2
(A1) (A2)
According to the above result, f˜ (x) − fz,λ (x) − f˜
z,λ (x) − fz,λ (x) ≥ 0 holds by the following condition
z,λ
" #
λ1 (K
f2 )
T1 (E1 , ˜1 ) ≥ 2 dλf T2 (E2 , ˜2 ) .
f2 ) K
λn (K 2

| {z }
=O(1)

We observe that, an invertible matrix K

f2 admits a finite condition number λ1 (K f2 )/λn (Kf2 ) < ∞. Besides, a fast polynomial eigenvalue
decay of K f2 ensures the effective dimension dλ to be finite, which can be obtained by Assumption 5 with γ = 0. Accordingly, in
Kf2
this case, when these condition are satisfied, T1 (E1 , ˜1 ) ≥ cT2 (E2 , ˜2 ) can be achieved for some constant c, which can be intuitively
observed by Figure 8. Finally, we conclude the proof for existence.

A PPENDIX B
E XPERIMENTS
In this section, we detail the experimental settings and present the comparison results on the compared approaches on several benchmark
datasets across various kernels. This part is organized as follows.
• In Section B.1, we present experimental results across the Gaussian kernel on eight non-image datasets in terms of approximation
error, the time cost (sec.) for generating random features mappings, classification accuracy by linear regression and liblinear.
26

(a) letter

(b) ijcnn1

(d) cod-RNA

Figure 9. Results of various algorithms across the Gaussian kernel on the letter, ijcnn1, covtype, cod-RNA datasets.

• Results on approximation error and test accuracy (by linear regression) across arc-cosine kernels and polynomial kernels are
presented in Sections B.2 and B.3, respectively.
• In Section B.4, a ultra-large scale dataset is applied to further validate the related algorithms.

B.1 Results on Gaussian kernels

Figures 9, 10 show the approximation error for the Gaussian kernel, the time cost (sec.) of generating randomized feature mappings, and
the test accuracy yielded by linear regression and liblinear on the eight datasets, respectively. We see that as the number of random
features increases, these algorithms achieve a smaller approximation error and a higher classification accuracy for both classifiers. We
notice some interesting phenomena in terms of the relation between approximation quality and prediction performance, depending on
whether the feature dimension is low (i.e., s = 2d or s = 4d) or high (i.e., s = 16d or s = 32d). In particular, the algorithms with the
best kernel approximation performance are often different in the low-dimensional case and the high dimensional case. Therefore, no
algorithm always dominate the others. On the other hand, while the approximation quality of these algorithms varies, their prediction
performance are often similar. Further, to better understand the above observations, we summarize the best performing algorithm on each
dataset in terms of the approximation quality and classification accuracy in Table 7, as illustrated in our main text (refer to Section 6.2.1).
Regarding to computational efficiency, most algorithms achieve the similar time cost on generating random features except SSF
and LS-RFF. SSF requires constructing the transformation matrix by minimizing the discrete Riesz 0-energy in advance; LS-RFF is a
27

(a) EEG

(b) magic04

(d) a8a

Figure 10. Results of various algorithms across the Gaussian kernel on the EEG, magic04, skin, a8a datasets.

data-dependent algorithm that needs to calculate the approximated ridge leverage score. Nevertheless, Fastfood/SORF/ROM does not
achieve the reduction on time cost, which appears contradictory to the underlying theoretical result on time complexity. This might be
because, one hand, the feature dimension of the used datasets in our experiments often ranges from 10 to 100, except for the image
datasets. In this case, it appears difficult to observe the computational saving from O(sd) to O(s log d) or O(d log d). On the other
hand, in our experiments, due to the relatively inefficient Matlab implementation of Fast Discrete Walsh-Hadamard Transform, typical
algorithms (e.g., Fastfood/SORF/ROM) do not show a significant reduction on computational efficiency than RFF.

B.2 Results on Arc-cosine kernels

As mentioned before, according to Eq. (6), various algorithms based on different sampling strategies can be still applicable to arc-cosine
kernels, e.g., ORF, QMC, and Fastfood. Accordingly, eight representative algorithms are taken into comparison on arc-cosine kernels,
including RFF, ORF, SORF, ROM, Fastfood, QMC, SSF, and GQ.
Figures 11, 12 show the approximation error and test accuracy across the zero/first-order arc-cosine kernels, respectively. It can
be observed that in most cases SSF and QMC achieve a lower approximation error than the other approaches, which corresponds to
the theoretical findings. However, there is no distinct difference on approximation between RFF and ORF/SORF. In fact, the current
theoretical results on ORF/SORF for variance reduction are only valid to the Gaussian kernel. Whether such results can be transferred
to arc-cosine kernels are still unclear. In general, the approximation performance and time cost (see Figure 14(a) and 14(b)) of these
algorithms on arc-cosine kernels are similar to that on the Gaussian kernel, though the approximation error value is often larger than
28

(a) letter (b) ijcnn1

(c) covtype (d) cod-RNA

(e) EEG (f) magic04

(g) skin (h) a8a

Figure 11. Results on eight datasets across the zero-order arc-cosine kernel.

that for the Gaussian kernel. This is because, according to Eq. (6), we actually conduct a d-dimensional integration approximation,
the smoothness of the integrand σ(ω > x)σ(ω> x0 ) would significantly effect the approximation performance as indicated by sampling
theory. In the term of classification performance, the difference in test accuracy of most algorithms is relatively small, which shows the
similar tendency with that of the Gaussian kernel.

B.3 Results on Polynomial kernels

For polynomial kernel approximation, we include three representative approaches, tensorized random projections (TRP) [74], TensorSketch
(TS) [73], and random Maclaurin (RM) [34] sketch evaluated on eight datasets for approximation and prediction. Since the polynomial
kernel can be written as a special type of tensor product, TS and TRP work in this setting by sketching a tensor product of arbitrary
vectors, which is different from RM using Maclaurin expansion. Figure 13 shows that, TS and TRP have the similar test accuracy, but
significantly perform better than RM, as RM’s generality is not required for the polynomial kernel. Besides, Figure 14(c) shows that RM
is quite computational efficient due to its Maclaurin expansion scheme; while TS takes much time on generating random features since it
utilizes a fixed sampling probability to compute the tensor sketch; while TRP works in a flexible sampling strategy proportional to its
Maclaurin coefficient.

B.4 Results on the MNIST-8M dataset

Here we evaluate the compared ten algorithms across the Gaussian kernel and arc-cosine kernels on the MNIST-8M dataset [139]. Due
to the memory limit, following the doubly stochastic framework [38], we incorporate these random features based approaches under
the data streaming setting for the reduction of time and space complexity. The experimental setting on this dataset follows with [38]:
the feature dimension d = 784 is reduced to 100 by PCA; the number of random features s is set to 4096; the used Gaussian RBF
29

(a) letter (b) ijcnn1

(c) covtype (d) cod-RNA

(e) EEG (f) magic04

(g) skin (h) a8a

Figure 12. Results on eight datasets across the first-order arc-cosine kernel.

kernel with kernel bandwidth ς equaling to four times the median pairwise distance; logistic regression with the regularization parameter
λ = 0.0005 for this multi-class classification task; the batch size is set to be 220 and feature block to be 215 . Besides, we report the total
time cost of each algorithm on generating feature mapping, training process and test process for evaluation.
Table 9 reports the approximation error, training error, test error, and the total time cost of each algorithm across the Gaussian kernel
and the zero/first-order arc-cosine kernels under s = 4096. It can be found that, ORF/SORF and SSF achieve the best approximation
performance on the Gaussian kernel, but ORF fails to significantly improve the approximation ability on arc-cosine kernels. This is
consistent with previous discussion on medium datasets in Section B.2.

R EFERENCES
[1] Bernhard Schölkopf and Alexander J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT Press, 2003.
[2] Johan A.K. Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle, Least Squares Support Vector Machines, World Scientific,
2002.
[3] Mehran Kafai and Kave Eshghi, “CROification: accurate kernel classification with the efficiency of sparse linear SVM,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 41, no. 1, pp. 34–48, 2019.
[4] Cho-Jui Hsieh, Si Si, and Inderjit Dhillon, “A divide-and-conquer solver for kernel support vector machines,” in International Conference on Machine
Learning, 2014, pp. 566–574.
[5] Yuchen Zhang, John Duchi, and Martin Wainwright, “Divide and conquer kernel ridge regression,” in Conference on Learning Theory, 2013, pp. 592–617.
[6] Fanghui Liu, Xiaolin Huang, Chen Gong, Jie Yang, and Li Li, “Learning data-adaptive non-parametric kernels,” Journal of Machine Learning Research,
vol. 21, no. 208, pp. 1–39, 2020.
[7] Alex J. Smola and Bernhard Schölkopf, “Sparse greedy matrix approximation for machine learning,” in International Conference on Machine Learning,
2000, pp. 911–918.
[8] Christopher K.I. Williams and Matthias Seeger, “Using the Nyström method to speed up kernel machines,” in Advances in Neural Information Processing
Systems, 2001, pp. 682–688.
30

(a) letter (b) ijcnn1

(c) covtype (d) cod-RNA

(e) EEG (f) magic04

(g) skin (h) a8a

Figure 13. Results on eight datasets across polynomial kernels.

[9] Ali Rahimi and Benjamin Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, 2007, pp.
1177–1184.
[10] David Lopez-Paz, Suvrit Sra, Alex J. Smola, Zoubin Ghahramani, and Bernhard Schölkopf, “Randomized nonlinear component analysis,” in International
Conference on Machine Learning, 2014, pp. 1359–1367.
[11] Yitong Sun, Anna Gilbert, and Ambuj Tewari, “But how does it work in theory? Linear SVM with random features,” in Advances in Neural Information
Processing Systems, 2018, pp. 3383–3392.
[12] Arthur Jacot, Franck Gabriel, and Clément Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in Advances in Neural
Information Processing Systems, 2018, pp. 8571–8580.
[13] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Russ R. Salakhutdinov, and Ruosong Wang, “On exact computation with an infinitely wide neural net,”
in Advances in Neural Information Processing Systems, 2019, pp. 8139–8148.
[14] Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin, “Scaling neural tangent kernels via sketching and random features,”
arXiv preprint arXiv:2106.07880, 2021.
[15] Simon S Du, Kangcheng Hou, Barnabás Póczos, Ruslan Salakhutdinov, Ruosong Wang, and Keyulu Xu, “Graph neural tangent kernel: Fusing graph neural
networks with graph kernels,” in Advances in Neural Information Processing Systems, 2019, pp. 1–11.
[16] Daniele Zambon, Cesare Alippi, and Lorenzo Livi, “Graph random neural features for distance-preserving graph representations,” in International
Conference on Machine Learning. PMLR, 2020, pp. 10968–10977.
[17] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
Lukasz Kaiser, and Weller Adrian, “Rethinking attention with performers,” in International Conference on Learning Representations, 2021.
[18] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong, “Random feature attention,” in International Conference on
Learning Representations, 2021, pp. 1–19.
[19] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv
preprint arXiv:1611.03530, 2016.
[20] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,”
the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019.
[21] Yuan Cao and Quanquan Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” in Advances in Neural Information
Processing Systems, 2019, pp. 10835–10845.
31

(a) zero-order arc-cosine kernel (b) first-order arc-cosine kernel (c) polynomial kernel

Figure 14. Comparison of various algorithms on the covtype dataset in terms of time cost for generating random features mappings.

Table 9
Comparison results of various algorithms across three kernels in terms of training error (%), classification error (%) and total time
cost (sec.) on the ultra-large MNIST 8M dataset.

kernel metric RFF QMC ORF SORF ROM Fastfood SSF GQ LS-RFF
approximation error 0.0126 0.0065 0.0041 0.0041 0.0046 0.0159 0.0078 0.0121 0.0147
Gaussian training error 0.22% 0.21% 0.19% 0.22% 0.19% 0.21% 0.20% 0.21% 0.22%
test error 0.99% 1.07% 1.11% 1.13% 0.99% 1.16% 1.09% 1.16% 0.97%
time cost (sec.) 13669 13999 14296 14526 13497 14343 14872 14322 15725
approximation error 0.0209 0.0124 0.0224 0.0231 0.0199 0.0246 0.0448 0.0383 0.0612
arccos0 training error 2.71% 2.70% 2.70% 2.70% 2.70% 2.60% 2.70% 3.02% 2.64%
test error 2.76% 2.91% 2.75% 2.86% 2.73% 2.94% 2.89% 3.00% 2.72%
time cost (sec.) 10577 10266 10501 10558 10595 10807 11235 10330 12231
approximation error 0.0394 0.0104 0.0310 0.0316 0.0259 0.0458 0.0198 0.0369 0.0357
arccos1 training error 0.93% 0.96% 0.94% 1.00% 0.94% 0.98% 0.95% 0.96% 0.93%
test error 1.64% 1.59% 1.52% 1.57% 1.62% 1.27% 1.34% 1.51% 1.62%
time cost (sec.) 9243.7 9170.3 9187.4 8861.6 8870.8 8824.1 9455.3 9188.1 9742.3

[22] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, “Fine-grained analysis of optimization and generalization for overparameterized
two-layer neural networks,” in International Conference on Machine Learning, 2019, pp. 322–332.
[23] Ziwei Ji and Matus Telgarsky, “Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks,” in
International Conference on Learning Representations, 2020, pp. 1–8.
[24] Felix Xinnan Yu, Ananda Theertha Suresh, Krzysztof Choromanski, Daniel Holtmannrice, and Sanjiv Kumar, “Orthogonal random features,” in Advances
in Neural Information Processing Systems, 2016, pp. 1975–1983.
[25] Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W. Mahoney, “Quasi-Monte Carlo feature maps for shift-invariant kernels,” Journal of Machine
Learning Research, vol. 17, no. 1, pp. 4096–4133, 2016.
[26] Tri Dao, Christopher M. De Sa, and Christopher Ré, “Gaussian quadrature for kernel features,” in Advances in neural information processing systems,
2017, pp. 6107–6117.
[27] Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, and Ivan Oseledets, “Quadrature-based features for kernel approximation,” in Advances in Neural
Information Processing Systems, 2018, pp. 9147–9156.
[28] Alaa Saade, Francesco Caltagirone, Igor Carron, Laurent Daudet, Angélique Drémeau, Sylvain Gigan, and Florent Krzakala, “Random projections through
multiple optical scattering: Approximating kernels at the speed of light,” in Proceedings og IEEE International Conference on Acoustics, Speech and Signal
Processing. IEEE, 2016, pp. 6215–6219.
[29] Ruben Ohana, Jonas Wacker, Jonathan Dong, Sébastien Marmin, Florent Krzakala, Maurizio Filippone, and Laurent Daudet, “Kernel computations from
large-scale random features obtained by optical processing units,” arXiv preprint arXiv:1910.09880, 2019.
[30] Danica J. Sutherland and Jeff Schneider, “On the error of random Fourier features,” in Conference on Uncertainty in Artificial Intelligence, 2015, pp.
862–871.
[31] Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic, “Towards a unified analysis of random Fourier features,” in the 36th International Conference
on Machine Learning, 2019, pp. 3905–3914.
[32] Ali Rahimi and Benjamin Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Advances in neural
information processing systems, 2009, pp. 1313–1320.
[33] Fuxin Li, Catalin Ionescu, and Cristian Sminchisescu, “Random Fourier approximations for skewed multiplicative histogram kernels,” in Joint Pattern
Recognition Symposium. Springer, 2010, pp. 262–271.
[34] Purushottam Kar and Harish Karnick, “Random feature maps for dot product kernels,” in International Conference on Artificial Intelligence and Statistics,
2012, pp. 583–591.
[35] Andrea Vedaldi and Andrew Zisserman, “Efficient additive kernels via explicit feature maps,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 34, no. 3, pp. 480–492, 2012.
[36] Quoc Le, Tamás Sarlós, and Alex J. Smola, “FastFood—approximating kernel expansions in loglinear time,” in International Conference on Machine
Learning, 2013, pp. 244–252.
[37] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney, “Quasi-Monte Carlo feature maps for shift-invariant kernels,” in International
Conference on Machine Learning, 2014, pp. 485–493.
[38] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song, “Scalable kernel methods via doubly stochastic gradients,” in
Advances in Neural Information Processing Systems, 2014, pp. 3041–3049.
[39] Jeffrey Pennington, Felix Xinnan X. Yu, and Sanjiv Kumar, “Spherical random features for polynomial kernels,” in Advances in Neural Information
Processing Systems, 2015, pp. 1846–1854.
32

[40] Chang Feng, Qinghua Hu, and Shizhong Liao, “Random feature mapping with signed circulant matrix projection,” in Twenty-Fourth International Joint
Conference on Artificial Intelligence, 2015.
[41] Krzysztof Choromanski and Vikas Sindhwani, “Recycling randomness with structure for sublinear time kernel expansions,” in International Conference on
Machine Learning, 2016, pp. 2502–2510.
[42] Weiwei Shen, Zhihui Yang, and Jun Wang, “Random features for shift-invariant kernels with moment matching,” in Thirty-First AAAI Conference on
Artificial Intelligence, 2017, pp. 2520–2526.
[43] Yueming Lyu, “Spherical structured feature maps for kernel approximation,” in 34th International Conference on Machine Learning. JMLR.org, 2017, pp.
2256–2264.
[44] Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh, “On data-dependent random features for improved generalization in supervised learning,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 4026–4033.
[45] Jian Zhang, Avner May, Tri Dao, and Christopher Re, “Low-precision random Fourier features for memory-constrained kernel approximation,” in 22nd
International Conference on Artificial Intelligence and Statistics, 2019, pp. 1264–1274.
[46] Raj Agrawal, Trevor Campbell, Jonathan Huggins, and Tamara Broderick, “Data-dependent compression of random features for large-scale kernel
approximation,” in 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1822–1831.
[47] Fanghui Liu, Xiaolin Huang, Yudong Chen, Jie Yang, and Johan A.K. Suykens, “Random Fourier features via fast surrogate leverage weighted sampling,”
in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, pp. 4844–4851.
[48] Tamás Erdélyi, Cameron Musco, and Christopher Musco, “Fourier sparse leverage scores and approximate kernel learning,” in Advances in Neural
Information Processing Systems, 2020.
[49] Bharath K. Sriperumbudur and Zoltán Szabó, “Optimal rates for random Fourier features,” in Advances in Neural Information Processing Systems, 2015,
pp. 1144–1152.
[50] Jean Honorio and Yu-Jun Li, “The error probability of random Fourier features is dimensionality independent,” arXiv preprint arXiv:1710.09953, 2017.
[51] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh, “Random Fourier features for kernel ridge
regression: Approximation bounds and statistical guarantees,” in International Conference on Machine Learning, 2017, pp. 253–262.
[52] Francis Bach, “On the equivalence between kernel quadrature rules and random feature expansions,” Journal of Machine Learning Research, vol. 18, no. 1,
pp. 714–751, 2017.
[53] Alessandro Rudi and Lorenzo Rosasco, “Generalization properties of learning with random features,” in Advances in Neural Information Processing
Systems, 2017, pp. 3215–3225.
[54] Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir, “Proving the lottery ticket hypothesis: Pruning is all you need,” arXiv preprint
arXiv:2002.00585, 2020.
[55] Salomon Bochner, Harmonic Analysis and the Theory of Probability, Courier Corporation, 2005.
[56] I. J. Schoenberg, “Positive definite functions on spheres,” Duke Mathematical Journal, vol. 9, no. 1, pp. 96–108, 1942.
[57] Alex J. Smola, Zoltan L. Ovari, and Robert C. Williamson, “Regularization with dot-product kernels,” in Advances in Neural Information Processing
Systems, 2001, pp. 308–314.
[58] Claus Müller, Spherical harmonics, vol. 17, Springer, 2006.
[59] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp.
489–501, 2006.
[60] Youngmin Cho and Lawrence K Saul, “Kernel methods for deep learning,” in Advances in Neural Information Processing Systems, 2009, pp. 342–350.
[61] Christopher K.I. Williams, “Computing with infinite networks,” in Advances in Neural Information Processing Systems, 1997, pp. 295–301.
[62] Dan Hendrycks and Kevin Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
[63] Amit Daniely, Roy Frostig, and Yoram Singer, “Toward deeper understanding of neural networks: The power of initialization and a dual view on
expressivity,” in Advances In Neural Information Processing Systems, 2016, pp. 2253–2261.
[64] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein, “Deep neural networks as Gaussian
Processes,” in International Conference on Learning Representations, 2018.
[65] Lenaic Chizat, Edouard Oyallon, and Francis Bach, “On lazy training in differentiable programming,” in Advances in Neural Information Processing
Systems, 2019, pp. 2933–2943.
[66] Alberto Bietti and Julien Mairal, “On the inductive bias of neural tangent kernels,” in Advances in Neural Information Processing Systems, 2019, pp.
12873–12884.
[67] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, “Linearized two-layers neural networks in high dimension,” Annals of
Statistics, 2019.
[68] Alberto Bietti and Francis Bach, “Deep equals shallow for ReLU networks in kernel regimes,” in International Conference on Learning Representations,
2021.
[69] Marc G. Genton, “Classes of kernels for machine learning: a statistics perspective,” Journal of Machine Learning Research, vol. 2, pp. 299–312, 2001.
[70] Sami Remes, Markus Heinonen, and Samuel Kaski, “Non-stationary spectral kernels,” in Advances in Neural Information Processing Systems, 2017, pp.
4642–4651.
[71] Jean-Francois Ton, Seth Flaxman, Dino Sejdinovic, and Samir Bhatt, “Spatial mapping with Gaussian processes and nonstationary Fourier features,”
Spatial Statistics, vol. 28, pp. 59–78, 2018.
[72] Akiva M Yaglom, Correlation Theory of Stationary and Related Random Functions, Springer-Verlag, 1987.
[73] Ninh Pham and Rasmus Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in ACM International Conference on Knowledge Discovery
and Data Mining, 2013, pp. 239–247.
[74] Michela Meister, Tamas Sarlos, and David Woodruff, “Tight dimensionality reduction for sketching low degree polynomial kernels,” in Advances in Neural
Information Processing Systems, 2019, pp. 9475–9486.
[75] Haim Avron, Huy Nguyen, and David Woodruff, “Subspace embeddings for the polynomial kernel,” in Advances in neural information processing systems,
2014, pp. 2258–2266.
[76] David P Woodruff and Amir Zandieh, “Near input sparsity time kernel embeddings via adaptive sampling,” in International Conference on Machine
Learning, 2020, pp. 10324–10333.
[77] Fanghui Liu, Lei Shi, Xiaolin Huang, Jie Yang, and Johan A.K. Suykens, “A double-variational Bayesian framework in random Fourier features for
indefinite kernels,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 8, pp. 2965–2979, 2020.
[78] Fanghui Liu, Xiaolin Huang, Yingyi Chen, and Johan A.K. Suykens, “Fast learning in reproducing kernel Kreı̆n spaces via signed measures,” in
International Conference on Artificial Intelligence and Statistics, 2021, pp. 1–11.
[79] Ping Li, “Linearized GMM kernels and normalized random Fourier features,” in 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2017, pp. 315–324.
[80] Krzysztof M. Choromanski, Mark Rowland, and Adrian Weller, “The unreasonable effectiveness of structured random orthogonal embeddings,” in
Advances in Neural Information Processing Systems, 2017, pp. 219–228.
[81] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco, “On fast leverage score sampling and optimal learning,” in Advances in
Neural Information Processing Systems, 2018, pp. 5672–5682.
[82] Wei-Cheng Chang, Chun-Liang Li, Yiming Yang, and Barnabás Póczos, “Data-driven random Fourier features using Stein effect,” in 26th International
Joint Conference on Artificial Intelligence, 2017, pp. 1497–1503.
33

[83] Aman Sinha and John C. Duchi, “Learning kernels with random features,” in Proceedins of Advances in Neural Information Processing Systems, 2016, pp.
1298–1306.
[84] Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, and Barnabas Poczos, “Implicit kernel learning,” in International Conference on
Artificial Intelligence and Statistics, 2019, pp. 2007–2016.
[85] Felix X. Yu, Sanjiv Kumar, Henry Rowley, and Shih Fu Chang, “Compact nonlinear maps and circulant extensions,” arXiv preprint arXiv:1503.03893,
2015.
[86] Brian Bullins, Cyril Zhang, and Yi Zhang, “Not-so-random features,” in International Conference on Learning Representations, 2018.
[87] Andrew Gordon Wilson and Ryan Prescott Adams, “Gaussian process kernels for pattern discovery and extrapolation,” in International Conference on
Machine Learning, 2013, pp. 1067–1075.
[88] Zichao Yang, Andrew Wilson, Alex J. Smola, and Le Song, “À la carte–learning fast kernels,” in Artificial Intelligence and Statistics, 2015, pp. 1098–1106.
[89] Zheyang Shen, Markus Heinonen, and Samuel Kaski, “Harmonizable mixture kernels with variational Fourier features,” in International Conference on
Artificial Intelligence and Statistics. PMLR, 2019.
[90] Junier B. Oliva, Avinava Dubey, Andrew G. Wilson, Barnabás Póczos, Jeff Schneider, and Eric P. Xing, “Bayesian nonparametric kernel learning,” in
International Conference on Artificial Intelligence and Statistics, 2016, pp. 1078–1086.
[91] Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cedric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tamas Sarlos, and Jamal
Atif, “Structured adaptive and random spinners for fast machine learning computations,” in Artificial Intelligence and Statistics, 2017, pp. 1020–1029.
[92] Harald Niederreiter, Random number generation and quasi-Monte Carlo methods, vol. 63, SIAM, 1992.
[93] Tianbao Yang, Yu Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi Hua Zhou, “Nyström method vs random Fourier features: a theoretical and empirical
comparison,” in Advances in Neural Information Processing Systems, 2012, pp. 476–484.
[94] Bo Xie, Yingyu Liang, and Le Song, “Scale up nonlinear component analysis with doubly stochastic gradients,” in Advances in Neural Information
Processing Systems, 2015, pp. 2341–2349.
[95] Xiang Li, Bin Gu, Shuang Ao, Huaimin Wang, and Charles X. Ling, “Triply stochastic gradients on multiple kernel learning.,” in Conference on Uncertainty
in Artificial Intelligence, 2017, pp. 1–9.
[96] Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller, “Unifying orthogonal Monte Carlo methods,” in International Conference on
Machine Learning, 2019, pp. 1203–1212.
[97] Krzysztof Choromanski, Mark Rowland, Tamás Sarlós, Vikas Sindhwani, Richard Turner, and Adrian Weller, “The geometry of random features,” in
International Conference on Artificial Intelligence and Statistics, 2018, pp. 1–9.
[98] Xiaoyun Li and Ping Li, “Quantization algorithms for random fourier features,” in International Conference on Machine Learning, 2021, pp. 6369–6380.
[99] Johann S. Brauchart and Peter J. Grabner, “Distributing many points on spheres: minimal energy and designs,” Journal of Complexity, vol. 31, no. 3, pp.
293–326, 2015.
[100] Yueming LYU, Yuan Yuan, and Ivor Tsang, “Subgroup-based rank-1 lattice quasi-monte carlo,” in Advances in Neural Information Processing Systems,
2020.
[101] Gwynne Evans, Practical numerical integration, Wiley New York, 1993.
[102] Alan Genz and John Monahan, “Stochastic integration rules for infinite regions,” SIAM Journal on Scientific Computing, vol. 19, no. 2, pp. 426–439, 1998.
[103] Alan Genz and John Monahan, “A stochastic algorithm for high-dimensional integrals over unbounded regions with gaussian weight,” Journal of
Computational and Applied Mathematics, vol. 112, no. 1-2, pp. 71–81, 1999.
[104] Florian Heiss and Viktor Winschel, “Likelihood approximation by numerical integration on sparse grids,” Journal of Econometrics, vol. 144, no. 1, pp.
62–80, 2008.
[105] Ayoub Belhadji, Rémi Bardenet, and Pierre Chainais, “Kernel quadrature with dpps,” in Advances in Neural Information Processing Systems, 2019, pp.
1–11.
[106] Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A.K. Suykens, “Towards a unified quadrature framework for large-scale kernel machines,” arXiv
preprint arXiv:2011.01668, 2020.
[107] François-Xavier Briol, Chris J Oates, Jon Cockayne, Wilson Ye Chen, and Mark Girolami, “On the sampling problem for kernel quadrature,” in
International Conference on Machine Learning, 2017, pp. 586–595.
[108] Bertrand Gauthier and Johan A.K. Suykens, “Optimal quadrature-sparsification for integral operator approximation,” SIAM Journal on Scientific Computing,
vol. 40, no. 5, pp. A3636–A3674, 2018.
[109] Yinsong Wang and Shahin Shahrampour, “A general scoring rule for randomized kernel approximation with application to canonical correlation analysis,”
arXiv preprint arXiv:1910.05384, 2019.
[110] Tong Zhang, “Learning bounds for kernel regression using effective data dimensionality,” Neural Computation, vol. 17, no. 9, pp. 2077–2098, 2005.
[111] Francis Bach, “Sharp analysis of low-rank kernel matrix approximations,” in Conference on Learning Theory, 2013, pp. 185–209.
[112] Hayata Yamasaki, Sathyawageeswar Subramanian, Sho Sonoda, and Masato Koashi, “Fast quantum algorithm for learning with optimized random features,”
in Advances in Neural Information Processing Systems, 2020, pp. 1–10.
[113] Ahmed Alaoui and Michael W Mahoney, “Fast randomized kernel ridge regression with statistical guarantees,” in Advances in Neural Information
Processing Systems, 2015, pp. 775–783.
[114] Daniele Calandriello, Alessandro Lazaric, and Michal Valko, “Distributed adaptive sampling for kernel matrix approximation,” in Artificial Intelligence and
Statistics, 2017, pp. 1421–1429.
[115] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh, “Two-stage learning kernel algorithms,” in International Conference on Machine Learning,
2010, pp. 239–246.
[116] Trevor Campbell and Tamara Broderick, “Bayesian coreset construction via greedy iterative geodesic ascent,” in International Conference on Machine
Learning, 2018, pp. 698–706.
[117] Trevor Campbell and Tamara Broderick, “Automated scalable Bayesian inference via Hilbert coresets,” Journal of Machine Learning Research, vol. 20, no.
1, pp. 551–588, 2019.
[118] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis Decoste, “Compact random feature maps,” in International Conference on Machine Learning, 2014,
pp. 19–27.
[119] Ali Rahimi and Benjamin Recht, “Uniform approximation of functions with random bases,” in Annual Allerton Conference on Communication, Control,
and Computing. IEEE, 2008, pp. 555–561.
[120] Mina Ghashami, Daniel J. Perry, and Jeff Phillips, “Streaming kernel principal component analysis,” in Artificial intelligence and statistics, 2016, pp.
1365–1374.
[121] Bharath Sriperumbudur and Nicholas Sterge, “Statistical consistency of kernel PCA with random features,” arXiv√preprint arXiv:1706.06296, 2017.
[122] Enayat Ullah, Poorya Mianjy, Teodor Vanislavov Marinov, and Raman Arora, “Streaming kernel PCA with Õ( n) random features,” in Advances in
Neural Information Processing Systems, 2018, pp. 7311–7321.
[123] Felipe Cucker and Dingxuan Zhou, Learning theory: an approximation theory viewpoint, vol. 24, Cambridge University Press, 2007.
[124] Ingo Steinwart and Christmann Andreas, Support Vector Machines, Springer Science and Business Media, 2008.
[125] Gilles Blanchard and Nicole Krämer, “Optimal learning rates for kernel conjugate gradient regression,” in Advances in Neural Information Processing
Systems, 2010, pp. 226–234.
[126] Andrea Caponnetto and Ernesto De Vito, “Optimal rates for the regularized least-squares algorithm,” Foundations of Computational Mathematics, vol. 7,
no. 3, pp. 331–368, 2007.
34

[127] John Shawe-Taylor, Chris Williams, Nello Cristianini, and Jaz Kandola, “On the eigenspectrum of the gram matrix and its relationship to the operator
eigenspectrum,” in International Conference on Algorithmic Learning Theory. Springer, 2002, pp. 23–40.
[128] Steve Smale and Ding-Xuan Zhou, “Learning theory estimates via integral operators and their approximations,” Constructive Approximation, vol. 26, no. 2,
pp. 153–172, 2007.
[129] Zheng-Chu Guo and Lei Shi, “Optimal rates for coefficient-based regularized regression,” Applied and Computational Harmonic Analysis, vol. 47, no. 3,
pp. 662–701, 2019.
[130] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou, “Distributed learning with regularized least squares,” Journal of Machine Learning Research, vol. 18, no. 1,
pp. 3202–3232, 2017.
[131] Vladimir Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, vol. 2033, Springer Science & Business Media,
2011.
[132] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson, “Local rademacher complexities,” Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005.
[133] Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic, “Towards a unified analysis of random fourier features,” Journal of Machine Learning
Research, vol. 22, no. 108, pp. 1–51, 2021.
[134] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco, “Learning with SGD and random features,” in Advances in Neural Information Processing
Systems, 2018, pp. 10212–10223.
[135] Ingo Steinwart and Clint Scovel, “Fast rates for support vector machines using Gaussian kernels,” Annals of Statistics, vol. 35, no. 2, pp. 575–607, 2007.
[136] Shusen Wang, “Simple and almost assumption-free out-of-sample bound for random feature mapping,” arXiv preprint arXiv:1909.11207, 2019.
[137] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE.
[138] Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Technical report, University of Toronto, 2009.
[139] Gaëlle Loosli, Stéphane Canu, and Léon Bottou, “Training invariant support vector machines using selective sampling,” Large scale kernel machines, vol.
2, 2007.
[140] Sanjeev Arora, Simon S Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli Yu, “Harnessing the power of infinitely wide deep nets on
small-data tasks,” in International Conference on Learning Representations, 2020.
[141] Sergey Ioffe and Christian Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in International
Conference on Machine Learning, 2015, pp. 448–456.
[142] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Fei Fei Li, “Imagenet: A large-scale hierarchical image database,” in Proceedins of the IEEE
Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[143] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” arXiv
preprint arXiv:1903.08560, 2019.
[144] Song Mei and Andrea Montanari, “The generalization error of random features regression: Precise asymptotics and double descent curve,” arXiv preprint
arXiv:1908.05355, 2019.
[145] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, “On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels,” in
Annual Conference on Learning Theory, 2019, pp. 1–32.
[146] Fanghui Liu, Zhenyu Liao, and Johan A.K. Suykens, “Kernel regression in high dimensions: Refined analysis beyond double descent,” in International
Conference on Artificial Intelligence and Statistics, 2021, pp. 1–11.
[147] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever, “Deep double descent: Where bigger models and more data
hurt,” in International Conference on Learning Representations, 2019.
[148] Mikhail Belkin, Siyuan Ma, and Soumik Mandal, “To understand deep learning we need to understand kernel learning,” in International Conference on
Machine Learning, 2018, pp. 541–549.
[149] Felipe Cucker and Steve Smale, “On the mathematical foundations of learning,” Bulletin of the American mathematical society, vol. 39, no. 1, pp. 1–49,
2002.
[150] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” in
Advances in neural information processing systems, 2019, pp. 6158–6169.
[151] Terence Tao, Topics in random matrix theory, American Mathematical Society, 2012.
[152] Jeffrey Pennington and Pratik Worah, “Nonlinear random matrix theory for deep learning,” in Advances in Neural Information Processing Systems, 2017,
pp. 2634–2643.
[153] Zhenyu Liao and Romain Couillet, “On the spectrum of random features maps of high dimensional data,” in International Conference on Machine
Learning, 2018, pp. 3063–3071.
[154] Zhenyu Liao, Romain Couillet, and Michael Mahoney, “A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase
transition, and the corresponding double descent,” in Neural Information Processing Systems, 2020.
[155] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Denny Wu, and Tianzong Zhang, “Generalization of two-layer neural networks: an asymptotic viewpoint,” in
International Conference on Learning Representations, 2020, pp. 1–8.
[156] Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala, “Double trouble in double descent: Bias and variance(s) in the lazy regime,” arXiv
preprint arXiv:2003.01054, 2020.
[157] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová, “Generalisation error in learning with random features and the
hidden manifold model,” in International Conference on Machine Learning, 2020, pp. 3452–3462.
[158] Ben Adlam and Jeffrey Pennington, “Understanding double descent requires a fine-grained bias-variance decomposition,” in Advances in neural information
processing systems, 2020.
[159] Oussama Dhifallah and Yue M Lu, “A precise performance analysis of learning with random features,” arXiv preprint arXiv:2008.11904, 2020.
[160] Hong Hu and Yue M Lu, “Universality laws for high-dimensional learning with random features,” arXiv preprint arXiv:2009.07669, 2020.
[161] Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, and Franck Gabriel, “Implicit regularization of random feature models,” in International
Conference on Machine Learning, 2020, pp. 4631–4640.
[162] Mikhail Belkin, Daniel Hsu, and Ji Xu, “Two models of double descent for weak features,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 4,
pp. 1167–1180, 2020.
[163] Jason W Rocks and Pankaj Mehta, “Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models,” arXiv preprint
arXiv:2010.13933, 2020.
[164] Licong Lin and Edgar Dobriban, “What causes the test error? going beyond bias-variance via anova,” arXiv preprint arXiv:2010.05170, 2020.
[165] Marc Mézard, Giorgio Parisi, and Miguel Angel Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications,
vol. 9, World Scientific Publishing Company, 1987.
[166] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, “Regularized linear regression: A precise analysis of the estimation error,” in Conference on
Learning Theory, 2015, pp. 1683–1709.
[167] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan, “The generalization error of max-margin linear classifiers: High-dimensional asymptotics in
the overparametrized regime,” arXiv preprint arXiv:1911.01544, 2019.
[168] Tengyuan Liang and Pragya Sur, “A precise high-dimensional asymptotic theory for boosting and min-`1 -norm interpolated classifiers,” arXiv preprint
arXiv:2002.01586, 2020.
[169] Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani, “Precise tradeoffs in adversarial training for linear regression,” in Conference on Learning
Theory, 2020, pp. 2034–2078.
35

[170] Cosme Louart, Zhenyu Liao, and Romain Couillet, “A random matrix approach to neural networks,” The Annals of Applied Probability, vol. 28, no. 2, pp.
1190–1248, 2018.
[171] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, “Generalization error of random features and kernel methods: hypercontractivity and kernel
matrix concentration,” arXiv preprint arXiv:2101.10588, 2021.
[172] Gilad Yehudai and Ohad Shamir, “On the power and limitations of random features for understanding neural networks,” in Advances in Neural Information
Processing Systems, 2019, pp. 6594–6604.
[173] Yitong Sun, Anna Gilbert, and Ambuj Tewari, “On the approximation properties of random ReLU features,” arXiv preprint arXiv:1810.04374, 2018.
[174] Jonathan Frankle and Michael Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning
Representations, 2019.
[175] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep
architectures,” arXiv preprint arXiv:1607.03250, 2016.
[176] Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, and Alessandro Rudi, “Kernel methods through the roof: Handling billions of points efficiently,” in
Advances in Neural Information Processing Systems, 2020.
[177] Lin Chen, Yifei Min, Mikhail Belkin, and Amin Karbasi, “Multiple descent: Design your own generalization curve,” arXiv preprint arXiv:2008.01036,
2020.
[178] Denny Wu and Ji Xu, “On the optimal weighted `2 regularization in overparameterized linear regression,” in Advances in Neural Information Processing
Systems, 2020, pp. 1–11.
[179] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez, “The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the
implicit ridge regularization,” Journal of Machine Learning Research, vol. 21, no. 169, pp. 1–16, 2020.
[180] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, “Limitations of lazy training of two-layers neural network,” in Advances in
Neural Information Processing Systems, 2019, pp. 9108–9118.

Linear Algebra
67% (3)
Linear Algebra
395 pages
Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
Cambridge IA Linear Algebra
No ratings yet
Cambridge IA Linear Algebra
148 pages
Introduction To The Theory of Functional Differential Equations Methods and Applications Contemporary Mathematics and Its Applications Book Series)
100% (3)
Introduction To The Theory of Functional Differential Equations Methods and Applications Contemporary Mathematics and Its Applications Book Series)
324 pages
Topic 054 Linear Operators: Operator: in The Case of Vector Spaces and in Particular Normed
No ratings yet
Topic 054 Linear Operators: Operator: in The Case of Vector Spaces and in Particular Normed
28 pages
Finalexam 2
No ratings yet
Finalexam 2
172 pages
Part IA - Vectors and Matrices: Definitions
No ratings yet
Part IA - Vectors and Matrices: Definitions
19 pages
NEP 2020 Mathematics
No ratings yet
NEP 2020 Mathematics
23 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
A Pari/GP Tutorial: by Robert B. Ash, Jan. 2007
No ratings yet
A Pari/GP Tutorial: by Robert B. Ash, Jan. 2007
20 pages
CS Iii - Iv - Sem - Syllabus
No ratings yet
CS Iii - Iv - Sem - Syllabus
63 pages
The Core Ideas in Our Teaching: Gilbert Strang
No ratings yet
The Core Ideas in Our Teaching: Gilbert Strang
3 pages
Orthogonality and Vector Spaces
No ratings yet
Orthogonality and Vector Spaces
18 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
Reproducing Kernel Banach Spaces For Machine Learning: Haizhang Zhang Yuesheng Xu
No ratings yet
Reproducing Kernel Banach Spaces For Machine Learning: Haizhang Zhang Yuesheng Xu
35 pages
Vector-Valued Reproducing Kernel Hilbert Spaces: With Applications To Function Extension and Image Colorization
No ratings yet
Vector-Valued Reproducing Kernel Hilbert Spaces: With Applications To Function Extension and Image Colorization
71 pages
MT3095 Vle
100% (2)
MT3095 Vle
263 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Linear Transformations
No ratings yet
Linear Transformations
8 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
The Subject Math GRE Test: Linear Algebra: Oscar Vega Ovega@csufresno - Edu
No ratings yet
The Subject Math GRE Test: Linear Algebra: Oscar Vega Ovega@csufresno - Edu
9 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
COS-MATH-241-Linear Algebra Spring 2020 - Midterm Exam 2 (20%)
No ratings yet
COS-MATH-241-Linear Algebra Spring 2020 - Midterm Exam 2 (20%)
2 pages
Multi-Model Deep Neural Network Based Features Extraction and Optimal Selection Approach For Skin Lesion Classification
No ratings yet
Multi-Model Deep Neural Network Based Features Extraction and Optimal Selection Approach For Skin Lesion Classification
7 pages
Second Sem
No ratings yet
Second Sem
18 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Syllabus M.A. M.sc. B.A and B.SC Mathematics
No ratings yet
Syllabus M.A. M.sc. B.A and B.SC Mathematics
26 pages
Handbook 2011-2012
No ratings yet
Handbook 2011-2012
41 pages
Quantum Machine Learning in Feature Hilbert Spaces: Maria@xanadu - Ai
No ratings yet
Quantum Machine Learning in Feature Hilbert Spaces: Maria@xanadu - Ai
12 pages
Adaptive Sharpen
No ratings yet
Adaptive Sharpen
4 pages
Memory Efficient Kernel Approximation: . This Work Was Done Before Joining Google
No ratings yet
Memory Efficient Kernel Approximation: . This Work Was Done Before Joining Google
32 pages
Class04 Feature+Kernels
No ratings yet
Class04 Feature+Kernels
35 pages
Ds 11
No ratings yet
Ds 11
21 pages
1749 Simplex Random Features
No ratings yet
1749 Simplex Random Features
25 pages
Tutorial 2
No ratings yet
Tutorial 2
3 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
Advanced Matrix Theory and Linear Algebra For Engineers Video Course - Syllabus
No ratings yet
Advanced Matrix Theory and Linear Algebra For Engineers Video Course - Syllabus
3 pages
Btech Electrical 4th Year 1624713084
No ratings yet
Btech Electrical 4th Year 1624713084
37 pages
Period Kernal Approx
No ratings yet
Period Kernal Approx
11 pages
Textbook Elementary Linear Algebra 1St Edition James R Kirkwood Ebook All Chapter PDF
100% (25)
Textbook Elementary Linear Algebra 1St Edition James R Kirkwood Ebook All Chapter PDF
53 pages
Kernels Regularization and Differential Equations
No ratings yet
Kernels Regularization and Differential Equations
16 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Question Bank2
No ratings yet
Question Bank2
8 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Scikit Learn Org Stable Modules Kernel - Approximation HTML
No ratings yet
Scikit Learn Org Stable Modules Kernel - Approximation HTML
3 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
No ratings yet
Divide and Conquer Kernel Ridge Regression: University of California, Berkeley University of California, Berkeley
26 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Kernel Methods
No ratings yet
Kernel Methods
6 pages
Kernel Method
No ratings yet
Kernel Method
5 pages
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
No ratings yet
Reproducing Kernel Hilbert - Module and Kernel Mean Embeddings
56 pages
Learning Multidimensional Fourier Series With Tensor Trains
No ratings yet
Learning Multidimensional Fourier Series With Tensor Trains
6 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
28.9 - Domain Specific Kernels - mp4
No ratings yet
28.9 - Domain Specific Kernels - mp4
2 pages
Lect2 08web
No ratings yet
Lect2 08web
16 pages
Maths LAB Programs BMATS201
No ratings yet
Maths LAB Programs BMATS201
10 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
DL Unit 1
No ratings yet
DL Unit 1
20 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Análise Espacial Com Regressão Linear e Kernel
No ratings yet
Análise Espacial Com Regressão Linear e Kernel
12 pages
Grauman Darrell Iccv05
No ratings yet
Grauman Darrell Iccv05
8 pages
Vahid
No ratings yet
Vahid
18 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
No ratings yet
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
45 pages
ML Mod 4
No ratings yet
ML Mod 4
26 pages
Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning
No ratings yet
Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning
36 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
S2 Group A
No ratings yet
S2 Group A
12 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
3 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Syllabus-Math 120
No ratings yet
Syllabus-Math 120
1 page
General Graph Random Features
No ratings yet
General Graph Random Features
18 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Lec 16
No ratings yet
Lec 16
23 pages
Kernel Computations From Large-Scale Random Features - Obtained - by - Optical - Processing - Units by Ruben Ohana ICASSP 2020
No ratings yet
Kernel Computations From Large-Scale Random Features - Obtained - by - Optical - Processing - Units by Ruben Ohana ICASSP 2020
5 pages
Random Fourier Features - Random Walks
No ratings yet
Random Fourier Features - Random Walks
10 pages
Approximation of Functionals On Korobov Spaces
No ratings yet
Approximation of Functionals On Korobov Spaces
19 pages
A Distributed Framework For Trimmed Kernel K-Means Clustering
No ratings yet
A Distributed Framework For Trimmed Kernel K-Means Clustering
14 pages
Spatio-Temporal Prediction Via Operator-Valued RKHS and Koopman Approximation
No ratings yet
Spatio-Temporal Prediction Via Operator-Valued RKHS and Koopman Approximation
25 pages
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet

Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond

Uploaded by

Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond

Uploaded by

1

Random Features for Kernel Approximation: A

Index Terms—random features, kernel approximation, generalization properties, over-parameterized models

RFF spawns a new direction for kernel approximation, and Table 1

poly kernel [34], addtive kernel [35]

kk − k̃k∞ : kk − k̃k∞ : [30], [49]

∆-spectral approximation and empirical risk: [51]

excess risk: ω ∼ q(·) [52]

SVM: ≤ Ω(n) [11]

KRR, SVM: ω ∼ p(·), ω ∼ q(·) [31] (best paper

RFF in DNNs: double descent [20],

Figure 2. A taxonomy of representative random features based algorithms.

which leads to the explicit feature mapping

where ςp2 = Ep [ω> ω] = tr ∇2 k(0) ∈ O(d), and Cd :=

be rewritten as [50] Theorem 5.4] present a non-asymptotic comparison result between

6. A more general condition (r > 0) is often considered in approximation

There is an abnormal but common experiment phenomenon Table 6

adequate in SSF, additional points (i.e., a larger s) may have a small

6.2.2 Classification results on MNIST and CIFAR10

compared to general arbitrary deep networks. This is a potentially

[156] expected excess risk N (0, Id ) hx, ζi ReLU N (0, 1) 3 refined 2

[158, Theorem 1] generalization error N (0, Id ) hx, ζi normalized N (0, 1) 3 refined 2

[154, Theorem 3] generalization error general -3 [cos(·), sin(·)] N (0, 1) 3 %&

k(x, X)(K + nλI)−1 E2

k(x, X)(K + nλI)−1 E1

We observe that, an invertible matrix K

B.1 Results on Gaussian kernels

B.2 Results on Arc-cosine kernels

(a) letter (b) ijcnn1

(c) covtype (d) cod-RNA

(e) EEG (f) magic04

(g) skin (h) a8a

B.3 Results on Polynomial kernels

B.4 Results on the MNIST-8M dataset

(a) letter (b) ijcnn1

(c) covtype (d) cod-RNA

(e) EEG (f) magic04

(g) skin (h) a8a

(a) letter (b) ijcnn1

(c) covtype (d) cod-RNA

(e) EEG (f) magic04

(g) skin (h) a8a

Figure 13. Results on eight datasets across polynomial kernels.

You might also like

where ςp2 = Ep [ω> ω] = tr ∇2 k(0) ∈ O(d), and Cd :=